Crafted RC | Maker Gear & Tools Curated by Code
← Back

Why I Switched from LLaMA 7B to Mistral 7B (And What It Took)

By Joe Stasio on May 13, 2025

LLaMA 7B Got Me Started—Then Hit a Wall

When I first spun up my local FastAPI-based LLM stack, LLaMA 7B was the engine. It worked. Kind of. It could handle basic completions and even simple coding prompts, but anything that required structured understanding or nuanced tone fell apart. My comment endpoint was constantly backfilling with fallback junk like “This cracked me up: the twist wild.” Not good enough.

Broken Templates, Mid-Level IQ

At one point, the bot was pushing hardcoded comment stubs back into the site—junk like “Wait what?! this post sheesh.” Even when I tried to clean it up, LLaMA would echo my pre_prompt or hallucinate formatting. Some replies literally read: “You should write a comment as if reacting to the news...” Like bro, you ARE the comment. It had to go.

Mistral 7B: Drop-In, Punch Up

I swapped in Mistral 7B using the same pipeline. Same quantization layer, same FastAPI stack. But the difference was immediate: faster token generation, better grasp of pre_prompt structure, and actual human-like phrasing. No more template hallucination. No more fake intros.

Running in WSL2 on Windows 11

I’m running all of this inside an Ubuntu LTS environment under WSL2 on Windows 11. Hardware-wise, it’s an Intel i5-10400F CPU, an NVIDIA RTX 2070 Super, and 32GB of RAM. Mistral loads without a fuss and responds to most prompts in about 8 seconds. It’s not instant, but it’s entirely local, and way better than any cloud latency drama.

Quantization + Offload

I used 4-bit NF4 quantization with float16 compute via BitsAndBytesConfig. This keeps memory pressure down and lets me offload the model to disk where needed. Even with GPU support under WSL, everything stays stable—no VRAM thrashing, no weird driver crashes.

Prompt Engineering Got Easier

I kept my build_prompt flow tight: SYSTEM prompt at the top, then pre_prompt for tone/direction, then the raw user input. But I removed garbage like “generate a realistic comment” from the prompt body—it was causing echo replies. Mistral let me be blunt. “Here’s the news. React.” And it did.

Fixing the API Crash Loop

Before the switch, my FastAPI service was choking on threads. One bad comment string (or a fallback object that wasn’t a string) would throw an AttributeError inside engage_loop. Mistral’s reliability fixed half of that. The rest? I added checks to ensure nothing got posted unless it was a valid non-empty string. Cleaned it all up.

Deploying to Pi 5 Later

Even though I mostly use it on my main rig, I’ve mirrored the deployment on a Raspberry Pi 5 running NVMe and no GUI. Mistral still runs great there under 4-bit mode, but the main dev environment remains on Windows for now. I keep the same Python stack—FastAPI, uvicorn, Transformers, and systemd-style autorun with local endpoints exposed over LAN.

Endpoints That Actually Work

  • /generate — generic prompt/response with streaming params
  • /comment — AI reacts to news stories in emotional internet-speak
  • /guessing_game — strict yes/no replies for animal guessing prompts

I even cleaned up the Swagger docs and split the models to avoid leaking unrelated fields into routes like /comment. Now everything reflects reality. The API isn’t just functional—it’s clean.

Stack Snapshot

  • Windows 11 host with Ubuntu LTS WSL2
  • Intel i5-10400F + RTX 2070 Super + 32GB RAM
  • FastAPI backend with Swagger + schema validation
  • Mistral 7B quantized 4-bit via BitsAndBytes
  • Piper TTS + MPV for optional local audio
  • Local CLI tools, Postman scripts, and JS frontend hooks

The Result: Fast, Clean, Mine

No more fallback noise. No template pollution. No cloud dependency. Mistral 7B handles real prompts, responds with nuance, and runs stable across threads inside a local WSL2 setup. I can test it via curl or fire it off from any frontend. It’s fast, it’s clean, and it’s under my control now. Exactly how I like it.