ENG | Running large language models locally.
Running AI models like DeepSeek-R1 locally using Ollama and exploring their practical limitations.
Motivation
This article is based on my brief experiments with running AI locally and exploring its practical limitations. I wanted to avoid subscription fees during weeks with heavier AI use.
Just now there’s a hype about DeepSeek-R1 which is supposed to beat ChatGPT-4 and Claude 3.5 Sonnet, and the best part is that it’s free to download and you can run it on your own hardware… or can you?
Before you start: Reality check
Have realistic expectations when downloading model. Models around 14B to 16B parameters are the practical maximum for a 12GB GPU. These will run at decent speeds even on CPU - I tested on an old i5-6500T and it was usable. Moving up to 30B models changes everything - they run about four times slower and need roughly 23GB of RAM. Even on a Ryzen 5900X, the speed isn’t practical for regular use. Athough this varies by specific model - some are faster than others.
TL;DR: DeepSeek-R1:671b is out of question for your pathetic PC.
Windows
Get Ollama - rougly 800MB download - and install it.
- Open new command prompt and type
ollama
to see the available commands1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
PS C:\Users\pavel> ollama Usage: ollama [flags] ollama [command] Available Commands: serve Start ollama create Create a model from a Modelfile show Show information for a model run Run a model stop Stop a running model pull Pull a model from a registry push Push a model to a registry list List models ps List running models cp Copy a model rm Remove a model help Help about any command Flags: -h, --help help for ollama -v, --version Show version information Use "ollama [command] --help" for more information about a command.
- Pull and run some model, start with something small for testing
1 2 3
PS C:\Users\pavel> ollama run deepseek-r1:8b pulling manifest pulling 6340dc3229b0... 87% ▕████████████████████████████████████████████████ ▏ 4.3 GB/4.9 GB 61 MB/s 10s
One minute later later:
1 2 3 4 5 6 7 8 9
pulling manifest pulling 6340dc3229b0... 100% ▕████████████████████████████████████████████████████████▏ 4.9 GB pulling 369ca498f347... 100% ▕████████████████████████████████████████████████████████▏ 387 B pulling 6e4c38e1172f... 100% ▕████████████████████████████████████████████████████████▏ 1.1 KB pulling f4d24e9138dd... 100% ▕████████████████████████████████████████████████████████▏ 148 B pulling 0cb05c6e4e02... 100% ▕████████████████████████████████████████████████████████▏ 487 B verifying sha256 digest writing manifest success
- Interact with model Once the model is running, you can interact with it using the provided commands
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
>>> /? Available Commands: /set Set session variables /show Show model information /load <model> Load a session or model /save <model> Save your current session /clear Clear session context /bye Exit /?, /help Help for a command /? shortcuts Help for keyboard shortcuts Use """ to begin a multi-line message. >>> Hello <think> </think> Hello! How can I assist you today? 😊
Linux
On linux there’s a tarball you can download. It contains bin/ollama
binary and CUDA libraries in a directory lib/ollama/*
.
I created new user named ollama
and unpacked it in theirs home directory. Then it’s needed to run ~/bin/ollama serve &
to run a server in background and then ~/bin/ollama run codestral:22b
or something like this.
I chose more complicated way than running provided shell script and have server running all the time in the background.
You can use systemd file that are provided in official manual install instructions
Conclusion
Running AI models locally is impractical on PCs that are made for us, mortals. It’s not common to have GPU with 32GB RAM or so, even less GPU with 512GB RAM or more 😢. Yes, there are small models, but…
The real problem with small models goes beyond just memory constraints. They suffer from severe identity crisis and lack of expertise, which can be sometimes fun, but mostly frustrating. One moment the AI presents itself as a personal assistant that can’t code but claims it can play music, arrange appointments, and set notifications (which it can’t do in reality). After resetting the session, it suddenly “forgets” its limitations and writes code full of duplicate arrays and unused variables. When you point out these issues, it can barely guess what its own program does and makes vague speculations about variables “possibly being for future use.” Then a few paragraphs later, it completely forgets which code you were even discussing.
Side notes
The Economics of AI Cloud Services
What makes this whole situation even more interesting is the return on investment calculation for cloud-based AIs. NVidia makes boards with 1.5TB of RAM, 6kW power consumption and $400,000 price. But to repay only the hardware in 3 years - no electricity for server, cooling, maintenance, R&D, support staff, or redundancy systems - it needs to generate income of $14,000 per month. This means 700 users paying $20 monthly subscription fees, where each user can only utilize the server for about two minutes a day or one hour a month. When you look at these numbers, the current pricing of cloud AI services starts making more sense, even if it feels expensive for individual users.
DeepSeek and Media Double Standards
In late January 2025 Deepseek-R1 had intereseting effect. First it was called as breakthrough technology as it is claimed that it was trained at fraction of cost of other models and it also surpased them in benchmarks. Then there were various articles spreading how unethical Deepseek is, because it was trained on outputs from ChatGPT, Claude.AI and alike, how it refuses to answer topics which are sensitive to China, how it steals your data.
Very hypocritical was a stance of OpenAI which accused DeepSeek for using their outputs to train their own model which is against terms of service. Two or three years before, it was OpenAI scraping content all over the internet, copyrighted or not just to achieve technological leadership in the name of innovation and american dominance. Not to mention that DeepSeek models and some papers are public whereas there’s nothing open about open AI.
Nonetheless, media coverage clearly shows double standard and high level of bias.
Addendum (2025-02-02)
qwen2.5-coder:32b
is actually quite good model from Alibaba, especially comparing to Codestral which is not far from answering that it’s a toaster which can’t code. Full cloud model also seem pretty capable.
Week after release of Deepseek-R1, OpenAI released new O3-mini model, QWEN released Qwen-2.5-Max so suddenly there are more options.
Nonetheless, at this time they seem similar in capabilities, nothing stands out and personal experience could be affected by randomization of output.
Resources
- ollama.com - Download for Windows, Linux, manual setup guide.
- Fireship -=- I built a DeepSeek R1 powered VS Code extension - youtube video how to install Ollama (and more)
- Fireship -=- DeepSeek stole our tech… says OpenAI - youtube video about DeepSeek-R1
- NVIDIA Umbriel B200 Baseboard 1.5TB HBM3e - GPU for largest AI models.
- ServeTheHome -=- Inside the SUPER NVIDIA H200 Server From Supermicro - youtube video about GPU server with board similar to above, quite fascinating.