What models are currently good for running coding tasks? I just ran Qwen3-14B-Q6_K.gguf with llama.cpp on my card with 16GB of vram (+32GB ddr4), but I get really close to filling the entire vram on a single short conversation, so I am looking for some (smaller) alternatives to test.
I might throw OpenCode container in the mix next, if that is relevant information.
spoiler
podman run --rm --replace --pull=newer \
--name llama \
-p 8080:8080 \
-v ./llama_models:/models:Z \
--device /dev/dri/card1:/dev/dri/card1 \
--device /dev/dri/renderD128:/dev/dri/renderD128 \
ghcr.io/ggml-org/llama.cpp:full-vulkan \
--server \
-m /models/Qwen3-14B-Q6_K.gguf \
-ngl 99 \
-fa on \
-c 16384 \
--temp 0.6 \
--top-k 20 \
--top-p 0.95 \
--jinja \
--host 0.0.0.0 --port 8080


I run
gemma4:26bin 16 GB of RAM. It’s slow on my test rig with only 2 GB VRAM but it should fit 16 GB VRAM fine. I have one of those AMD BC-250 crypto mining units setup as a gaming rig, but my plan was to also run ollama on it.gemma4:26bwas the model I planned to make the default. I haven’t messed with it yet since I’m playing through my Steam catalog that was waiting for me to have a PC that could run them lol.