Post

ENG | Running large language models locally.

Running AI models like DeepSeek-R1 locally using Ollama and exploring their practical limitations.

ENG | Running large language models locally.

Motivation

This article is based on my brief experiments with running AI locally and exploring its practical limitations. I wanted to avoid subscription fees during weeks with heavier AI use.

Just now there’s a hype about DeepSeek-R1 which is supposed to beat ChatGPT-4 and Claude 3.5 Sonnet, and the best part is that it’s free to download and you can run it on your own hardware… or can you?

Before you start: Reality check

Have realistic expectations when downloading model. Models around 14B to 16B parameters are the practical maximum for a 12GB GPU. These will run at decent speeds even on CPU - I tested on an old i5-6500T and it was usable. Moving up to 30B models changes everything - they run about four times slower and need roughly 23GB of RAM. Even on a Ryzen 5900X, the speed isn’t practical for regular use. Athough this varies by specific model - some are faster than others.

TL;DR: DeepSeek-R1:671b is out of question for your pathetic PC.

Windows

  • Get Ollama - rougly 800MB download - and install it.

  • Open new command prompt and type ollama to see the available commands
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    
    PS C:\Users\pavel> ollama
    Usage:
      ollama [flags]
      ollama [command]
    
    Available Commands:
      serve       Start ollama
      create      Create a model from a Modelfile
      show        Show information for a model
      run         Run a model
      stop        Stop a running model
      pull        Pull a model from a registry
      push        Push a model to a registry
      list        List models
      ps          List running models
      cp          Copy a model
      rm          Remove a model
      help        Help about any command
    
    Flags:
      -h, --help      help for ollama
      -v, --version   Show version information
    
    Use "ollama [command] --help" for more information about a command.
    
  • Pull and run some model, start with something small for testing
    1
    2
    3
    
    PS C:\Users\pavel> ollama run deepseek-r1:8b
    pulling manifest
    pulling 6340dc3229b0...  87% ▕████████████████████████████████████████████████        ▏ 4.3 GB/4.9 GB   61 MB/s     10s
    

    One minute later later:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    
    pulling manifest
    pulling 6340dc3229b0... 100% ▕████████████████████████████████████████████████████████▏ 4.9 GB
    pulling 369ca498f347... 100% ▕████████████████████████████████████████████████████████▏  387 B
    pulling 6e4c38e1172f... 100% ▕████████████████████████████████████████████████████████▏ 1.1 KB
    pulling f4d24e9138dd... 100% ▕████████████████████████████████████████████████████████▏  148 B
    pulling 0cb05c6e4e02... 100% ▕████████████████████████████████████████████████████████▏  487 B
    verifying sha256 digest
    writing manifest
    success
    
  • Interact with model Once the model is running, you can interact with it using the provided commands
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    
    >>> /?
    Available Commands:
      /set            Set session variables
      /show           Show model information
      /load <model>   Load a session or model
      /save <model>   Save your current session
      /clear          Clear session context
      /bye            Exit
      /?, /help       Help for a command
      /? shortcuts    Help for keyboard shortcuts
    
    Use """ to begin a multi-line message.
    
    >>> Hello
    <think>
    
    </think>
    
    Hello! How can I assist you today? 😊
    

Linux

On linux there’s a tarball you can download. It contains bin/ollama binary and CUDA libraries in a directory lib/ollama/*.

I created new user named ollama and unpacked it in theirs home directory. Then it’s needed to run ~/bin/ollama serve & to run a server in background and then ~/bin/ollama run codestral:22b or something like this.

I chose more complicated way than running provided shell script and have server running all the time in the background.

You can use systemd file that are provided in official manual install instructions

Conclusion

Running AI models locally is impractical on PCs that are made for us, mortals. It’s not common to have GPU with 32GB RAM or so, even less GPU with 512GB RAM or more 😢. Yes, there are small models, but…

The real problem with small models goes beyond just memory constraints. They suffer from severe identity crisis and lack of expertise, which can be sometimes fun, but mostly frustrating. One moment the AI presents itself as a personal assistant that can’t code but claims it can play music, arrange appointments, and set notifications (which it can’t do in reality). After resetting the session, it suddenly “forgets” its limitations and writes code full of duplicate arrays and unused variables. When you point out these issues, it can barely guess what its own program does and makes vague speculations about variables “possibly being for future use.” Then a few paragraphs later, it completely forgets which code you were even discussing.

Side notes

The Economics of AI Cloud Services

What makes this whole situation even more interesting is the return on investment calculation for cloud-based AIs. NVidia makes boards with 1.5TB of RAM, 6kW power consumption and $400,000 price. But to repay only the hardware in 3 years - no electricity for server, cooling, maintenance, R&D, support staff, or redundancy systems - it needs to generate income of $14,000 per month. This means 700 users paying $20 monthly subscription fees, where each user can only utilize the server for about two minutes a day or one hour a month. When you look at these numbers, the current pricing of cloud AI services starts making more sense, even if it feels expensive for individual users.

DeepSeek and Media Double Standards

In late January 2025 Deepseek-R1 had intereseting effect. First it was called as breakthrough technology as it is claimed that it was trained at fraction of cost of other models and it also surpased them in benchmarks. Then there were various articles spreading how unethical Deepseek is, because it was trained on outputs from ChatGPT, Claude.AI and alike, how it refuses to answer topics which are sensitive to China, how it steals your data.

Very hypocritical was a stance of OpenAI which accused DeepSeek for using their outputs to train their own model which is against terms of service. Two or three years before, it was OpenAI scraping content all over the internet, copyrighted or not just to achieve technological leadership in the name of innovation and american dominance. Not to mention that DeepSeek models and some papers are public whereas there’s nothing open about open AI.

Nonetheless, media coverage clearly shows double standard and high level of bias.

Addendum (2025-02-02)

qwen2.5-coder:32b is actually quite good model from Alibaba, especially comparing to Codestral which is not far from answering that it’s a toaster which can’t code. Full cloud model also seem pretty capable.

Week after release of Deepseek-R1, OpenAI released new O3-mini model, QWEN released Qwen-2.5-Max so suddenly there are more options.

Nonetheless, at this time they seem similar in capabilities, nothing stands out and personal experience could be affected by randomization of output.

Resources

This post is licensed under CC BY 4.0 by the author.