Local LLMs for Beginners — First Steps

Start here

Running a language model on your own computer sounds technical. It does not have to be. You need three decisions: which app to use, which model file to download, and whether your machine has enough memory to run it comfortably.

This guide is for beginners — people who opened LM Studio, saw fifty versions of the same model, and wondered what GGUF (GPT-Generated Unified Format), MLX (Apple’s machine learning framework), Q4 (4-bit quantization), and “full GPU offload” actually mean. By the end you will know what to install first and what to ignore.

What “local LLM” really means

A local LLM (large language model) is an AI model that runs on your laptop or desktop instead of a cloud service like ChatGPT. You download the model once, then chat offline. Your prompts stay on your machine unless you choose to send them somewhere else.

Two free apps dominate this space: LM Studio (friendly desktop app) and Ollama (lightweight engine with a built-in API). They often run the same model files underneath. The difference is workflow, not magic model quality.

Step 1 — Pick your first tool

If this is your first time: start with LM Studio. Download the app, browse models visually, click download, start chatting. No terminal required.

When you outgrow clicking: install Ollama. It shines when other apps — Open WebUI, coding assistants, scripts — need a stable local API (application programming interface) on port 11434.

Many experienced users keep both: LM Studio to explore and compare models, Ollama for daily use and integrations. That is a normal setup, not a failure to pick a side.

For a full comparison of the two apps, read Ollama vs LM Studio: Which One Should You Use? .

Step 2 — Read the model name (without panic)

Model pages list many variants of the same brain. The name is telling you two things: which model family it is, and how compressed the file is.

GGUF (GPT-Generated Unified Format) — the default format

GGUF — short for GPT-Generated Unified Format — is the standard packaged format for running models with llama.cpp (what Ollama and LM Studio often use). When you see GGUF on a download page, think: “works everywhere, good default.”

MLX (Apple machine learning framework) — Apple-optimized (optional)

MLX is Apple’s open-source machine learning framework for Apple Silicon. MLX builds are tuned for M-series Macs. They can be very fast in LM Studio, but the ecosystem around Ollama and most third-party tools expects GGUF. Beginners should start with GGUF unless they are experimenting inside LM Studio only.

Quantization — what Q4, Q8, and bf16 mean

Q4_K_M (often “Q4”, 4-bit quantization) — smaller file, less RAM, slightly lower quality. Best default for most people.
Q8_0 (8-bit quantization) — bigger file, better quality, slower. Use when you have spare memory and care about nuance.
bf16 (16-bit brain floating point) — largest, slowest, research-grade. Skip until you know you need it.
MoE (mixture of experts) tags like A3B — the model is large on paper but only part of it activates per token, so it can feel smarter without using full 70B-class memory.

You do not need every variant. Pick one Q4 file per job and move on.

Step 3 — Will it run on my Mac?

On Apple Silicon, CPU and GPU (graphics processing unit) share the same memory pool (“unified memory”). LM Studio shows a green “Full GPU Offload Possible” badge when the model plus context buffer likely fits. That badge is a helpful estimate, not a guarantee.

Simple rule for a Mac with 32–64 GB of memory: quantized models up to roughly 30–35 GB on disk usually run comfortably. bf16 builds and huge context windows (128K+) eat memory fast.

Check your RAM: Apple menu → About This Mac, or run system_profiler SPHardwareDataType in Terminal.
Prefer Q4 quantizations first — they are the beginner sweet spot.
After loading a model, open Activity Monitor → GPU. If Ollama or LM Studio shows GPU activity and the machine stays responsive, you are in good shape.
If the fan screams and the system stutters, try a smaller model or a stronger quantization (Q4 instead of Q8).

On Intel Macs or PCs with a discrete NVIDIA GPU, the same logic applies — fit the model in VRAM (video memory) or RAM, start with Q4, scale up only when needed.

Step 4 — Install three models, not thirty

Listing pages show dozens of Qwen, Gemma, and Mistral builds. That is a catalog, not a shopping list. Here is a practical starter set if you have a modern Mac with 32 GB RAM or more (adjust down for 16 GB — stick to 7B–14B class models):

# Daily driver — strong balance of quality and size (MoE)
ollama pull qwen3.6:35b-a3b-q4_K_M

# Coding and structured tasks — stable dense model
ollama pull qwen3.6:27b-q4_K_M

# Fast chat and quick questions
ollama pull qwen3.6:27b-mtp-q4_K_M

Use the 35B-A3B Q4 model as your default “brain” for general work.
Switch to 27B Q4 when you want predictable coding answers.
Use the MTP (multi-token prediction) variant when speed matters more than depth.
Skip bf16 and MLX variants until you know why you need them.

On LM Studio, search for the same model families and filter by Q4_K_M. Download one Gemma or Qwen build, chat for an evening, then decide — do not hoard fifteen checkpoints.

Why many developers prefer Ollama

Community threads on r/LocalLLaMA repeat the same themes. Preference for Ollama is usually about workflow, not because LM Studio runs “worse” models:

Open, script-friendly design — pull models from the terminal, run in Docker, wire into CI/CD.
REST (Representational State Transfer) API on by default — Open WebUI, Continue, Cline, and custom apps connect without extra setup.
Automatic load/unload — saves memory when multiple apps share one backend.
Runs headless on Linux servers — LM Studio is built around a GUI.

Raw inference speed is often similar between the two apps because both lean on llama.cpp. Differences show up in RAM overhead, API design, and how many clients can hit the model at once.

Why LM Studio still earns a place on your disk

Visual model browser tied to Hugging Face — see sizes and ratings before downloading.
Sliders for temperature and context while you chat — great for learning.
Lowest-friction first conversation — no commands to memorize.
Fast sandbox for comparing quantizations before you commit in Ollama.

Choose your path

First evening with local AI → LM Studio.
Connecting a chat UI or coding tool → Ollama (+ Open WebUI or your editor plugin).
Exploring whether Q4 or Q8 is worth the disk space → LM Studio, then pull the winner with Ollama.
Automation, Docker, homelab server → Ollama.
Not sure yet → install both; use Studio to browse, Ollama to keep.

Bottom line

Local LLMs reward a simple setup: one friendly app to explore, one engine to integrate, two or three Q4 models instead of a warehouse of checkpoints. Learn the naming once, pick tools for your actual workflow, and upgrade models only when you hit a real limit — not because the catalog added another badge.

Sources & further reading

Community context and comparisons referenced in this guide (reviewed June 2026):

Reddit r/LocalLLaMA — Why do people like Ollama more than LM Studio?
Zen van Riel — Ollama vs LM Studio comparison (architecture, API, memory)
On this site — Ollama vs LM Studio: full comparison with benchmarks