Run DeepSeek-R1 Locally: Complete Tutorial for 671B Parameter Model

DeepSeek-R1 has taken the AI world by storm. With 671 billion parameters and reasoning capabilities that rival GPT-4o, it’s one of the most powerful open-weight models ever released. The best part? You can run it locally, completely free, with zero data leaving your machine.

This tutorial walks you through every step of deploying DeepSeek-R1 on your own hardware.

TL;DR

DeepSeek-R1 is a 671B parameter reasoning model available in quantized versions
Use Ollama for command-line deployment or LM Studio for a GUI experience
Minimum requirements: 16GB RAM for 7B distilled, 32GB+ for larger versions
Apple Silicon Macs offer the best consumer experience due to unified memory
The model excels at coding, math, and complex reasoning tasks

Why Run DeepSeek-R1 Locally?

Complete Privacy

When you use cloud AI services, your prompts are logged, analyzed, and potentially used for training. Running locally means:

No data transmission to external servers
No usage tracking or logging
Full control over your conversations

Zero Cost

DeepSeek-R1 is open-weight. Once downloaded, you own it forever. No subscriptions, no API fees, no usage limits.

Uncensored Responses

Cloud models are heavily filtered. Local models give you the raw output without corporate guardrails.

Hardware Requirements

DeepSeek-R1 comes in several sizes. Here’s what you need:

Model Version	Parameters	RAM Required	Best For
DeepSeek-R1-Distill-Qwen-1.5B	1.5B	4GB	Mobile, testing
DeepSeek-R1-Distill-Qwen-7B	7B	8GB	Basic tasks
DeepSeek-R1-Distill-Qwen-14B	14B	16GB	General use
DeepSeek-R1-Distill-Qwen-32B	32B	24GB	Advanced reasoning
DeepSeek-R1-Distill-Llama-70B	70B	48GB	Near full capability
DeepSeek-R1 (Full)	671B	400GB+	Enterprise/research

Platform Recommendations

Mac (Apple Silicon): The best consumer option. M2/M3/M4 chips with unified memory can run larger models than equivalent PC setups. A MacBook Pro with 32GB RAM can comfortably run the 32B version.

Windows/Linux (NVIDIA GPU): You need a dedicated GPU with sufficient VRAM. RTX 4090 (24GB) handles the 32B model well. For larger models, consider multi-GPU setups.

CPU Only: Possible but painfully slow. Expect 1-2 tokens per second on the 7B model.

Method 1: Using Ollama (Recommended)

Ollama is the simplest way to run local LLMs. It handles model downloads, quantization, and serving automatically.

Step 1: Install Ollama

Mac/Linux:

curl -fsSL https://ollama.com/install.sh | sh

Windows: Download the installer from ollama.com/download.

Step 2: Pull the DeepSeek-R1 Model

Open your terminal and run:

# For the 7B distilled version (recommended starting point)
ollama pull deepseek-r1:7b

# For the 14B version (better quality)
ollama pull deepseek-r1:14b

# For the 32B version (best quality for consumer hardware)
ollama pull deepseek-r1:32b

The download will take several minutes depending on your connection. The 7B model is approximately 4.5GB.

Step 3: Run the Model

Start an interactive chat session:

ollama run deepseek-r1:7b

You’ll see a prompt where you can type your questions:

>>> What is the time complexity of quicksort?

The average time complexity of quicksort is O(n log n), while the worst-case 
complexity is O(n²). The worst case occurs when the pivot selection consistently 
results in highly unbalanced partitions...

Step 4: Configure for Better Performance

Create a custom Modelfile for optimized settings:

# Create a file named Modelfile
cat << 'EOF' > Modelfile
FROM deepseek-r1:14b

PARAMETER temperature 0.7
PARAMETER num_ctx 8192
PARAMETER num_gpu 99

SYSTEM "You are a helpful AI assistant specialized in coding and technical analysis."
EOF

# Create the custom model
ollama create deepseek-custom -f Modelfile

Step 5: Use with API

Ollama exposes a REST API for integration with other tools:

curl http://localhost:11434/api/generate -d '{
  "model": "deepseek-r1:14b",
  "prompt": "Explain the difference between TCP and UDP",
  "stream": false
}'

Method 2: Using LM Studio (GUI Option)

LM Studio provides a visual interface for those who prefer not to use the command line.

Step 1: Download LM Studio

Get it from lmstudio.ai. Available for Mac, Windows, and Linux.

Step 2: Search for DeepSeek-R1

Open LM Studio
Click the “Search” tab
Type “DeepSeek-R1” in the search bar
Browse available quantizations

Step 3: Choose the Right Quantization

You’ll see options like Q4_K_M, Q5_K_M, Q8_0. Here’s what they mean:

Quantization	Quality	Size	Speed
Q4_K_M	Good	Smallest	Fastest
Q5_K_M	Better	Medium	Medium
Q8_0	Best	Largest	Slowest

For most users, Q4_K_M offers the best balance.

Step 4: Download and Load

Click the download button next to your chosen model
Wait for the download to complete
Go to the “Chat” tab
Select your downloaded model from the dropdown
Start chatting!

Step 5: Adjust Settings

In the right sidebar, you can configure:

Context Length: How much text the model remembers (default: 4096)
Temperature: Creativity level (0.0 = deterministic, 1.0 = creative)
GPU Layers: How much of the model to load into GPU memory

Method 3: Using vLLM (Advanced)

For production deployments or maximum performance, vLLM offers optimized inference.

Step 1: Install vLLM

pip install vllm

Step 2: Start the Server

python -m vllm.entrypoints.openai.api_server \
    --model deepseek-ai/DeepSeek-R1-Distill-Qwen-14B \
    --tensor-parallel-size 1 \
    --max-model-len 8192

Step 3: Query via OpenAI-Compatible API

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed"
)

response = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-R1-Distill-Qwen-14B",
    messages=[
        {"role": "user", "content": "Write a Python function to find prime numbers"}
    ]
)

print(response.choices[0].message.content)

Integrating with Your Workflow

VS Code Integration

Use the Continue extension to connect DeepSeek-R1 as your coding assistant:

Install Continue from the VS Code marketplace
Open Continue settings
Add Ollama as a provider:

{
  "models": [
    {
      "title": "DeepSeek-R1 Local",
      "provider": "ollama",
      "model": "deepseek-r1:14b"
    }
  ]
}

Now you have a free, private Copilot alternative. For more on AI coding tools, check our Cursor vs Copilot comparison.

Terminal Integration

Add an alias to your shell configuration:

# Add to ~/.zshrc or ~/.bashrc
alias ask='ollama run deepseek-r1:14b'

Now you can quickly query the model:

ask "How do I reverse a linked list in Python?"

Performance Optimization Tips

1. Use GPU Offloading

Ensure Ollama is using your GPU:

# Check GPU usage
ollama ps

# Force GPU layers
OLLAMA_NUM_GPU=99 ollama run deepseek-r1:14b

2. Adjust Context Length

Longer context uses more memory. If you’re running out of RAM:

ollama run deepseek-r1:14b --num-ctx 2048

3. Monitor Resource Usage

On Mac:

# Watch memory usage
watch -n 1 'memory_pressure'

On Linux:

# Watch GPU usage
watch -n 1 nvidia-smi

4. Use Quantized Models

If performance is slow, try a more aggressive quantization:

ollama pull deepseek-r1:7b-q4_0

Troubleshooting Common Issues

”Out of Memory” Errors

Solution: Use a smaller model or reduce context length.

# Try the 7B version instead of 14B
ollama run deepseek-r1:7b --num-ctx 2048

Slow Generation Speed

Causes:

Model too large for available RAM
GPU not being utilized
Running on CPU only

Solutions:

Check GPU is detected: ollama ps
Use a smaller quantization
Close other memory-intensive applications

Model Not Found

Solution: Pull the model first:

ollama pull deepseek-r1:7b

Garbled or Repetitive Output

Solution: Adjust temperature and repetition penalty:

ollama run deepseek-r1:14b --temperature 0.7 --repeat-penalty 1.1

DeepSeek-R1 vs Other Local Models

How does DeepSeek-R1 compare to alternatives?

Model	Reasoning	Coding	Speed	Size
DeepSeek-R1 (14B)	Excellent	Excellent	Medium	8GB
Llama 3.2 (8B)	Good	Good	Fast	4GB
Mistral (7B)	Good	Medium	Fast	4GB
Qwen 2.5 (14B)	Very Good	Very Good	Medium	8GB

DeepSeek-R1 excels at complex reasoning and coding tasks. For general chat, Llama 3.2 might be faster. For a deeper comparison, see our DeepSeek vs ChatGPT coding benchmark.

What Can You Build?

With DeepSeek-R1 running locally, you can:

Private Code Assistant: Review and generate code without sending proprietary code to the cloud
Document Analyzer: Process sensitive documents locally
Research Assistant: Analyze papers and data privately
Learning Tool: Practice coding with instant feedback

For more ideas on local AI setups, check our guide to running local LLMs.

Conclusion

Running DeepSeek-R1 locally is easier than ever. With tools like Ollama and LM Studio, you can have a GPT-4 class model running on your laptop in minutes.

The key steps:

Choose your method: Ollama for CLI, LM Studio for GUI
Select the right model size: Start with 7B, upgrade as needed
Optimize for your hardware: Adjust context length and quantization
Integrate into your workflow: VS Code, terminal aliases, or API

Privacy, zero cost, and full control. That’s the promise of local AI, and DeepSeek-R1 delivers.

Ready to explore more AI tools? Check out our AI Agents Guide for 2026 to see how local models can power autonomous assistants.

#deepseek#local-llm#ollama#lm-studio#privacy#tutorial