How to run Qwen 3.5 locally

(unsloth.ai)

179 points | by Curiositry 11 hours ago

13 comments

moqizhengz 4 hours ago
Running 3.5 9B on my ASUS 5070ti 16G with lm studio gives a stable ~100 tok/s. This outperforms the majority of online llm services and the actual quality of output matches the benchmark. This model is really something, first time ever having usable model on consumer-grade hardware.
[-]
- smokel 1 hour ago
  > This outperforms the majority of online llm services
  I assume you mean outperforms in speed on the same model, not in usability compared to other more capable models.
  (For those who are getting their hopes up on using local LLMs to be any replacement for Sonnet or Opus.)
  [-]
  - moffkalast 39 minutes ago
    Obviously it's not going to be of a paid tier 2T sized SOTA model quality, but it can probably roughly match Haiku at the very least. And for tasks that aren't super complex that's already enough.
    Personally though, I find Qwen useless for anything but coding tasks because if its insufferable sycophancy. It's like 4o dialed up to 20, every reply starts with "You are absolutely right" with zero self awareness. And for coding, only the best model available is usually sensible to use otherwise it's just wasted time.
    [-]
    - Anduia 30 minutes ago
      That's why I start any prompt to Qwen 3.5 with:
      persona: brief rude senior
- throwdbaaway 3 hours ago
  There are Qwen3.5 27B quants in the range of 4 bits per weight, which fits into 16G of VRAM. The quality is comparable to Sonnet 4.0 from summer 2025. Inference speed is very good with ik_llama.cpp, and still decent with mainline llama.cpp.
  [-]
  - codemog 2 hours ago
    Can someone explain how a 27B model (quantized no less) ever be comparable to a model like Sonnet 4.0 which is likely in the mid to high hundreds of billions of parameters?
    Is it really just more training data? I doubt it’s architecture improvements, or at the very least, I imagine any architecture improvements are marginal.
    [-]
    - girvo 1 hour ago
      Considering the full fat Qwen3.5-plus is good, but barely Sonnet 4 good in my testing (but incredibly cheap!) I doubt the quantised versions are somehow as good if not better in practice.
      [-]
      - rustyhancock 11 minutes ago
        I think it depends on work pattern.
        Many do not give Sonnet or even Opus full reign where it really pushes ahead of over models.
        If you're asking for tightly constrained single functions at a time it really doesn't make a huge difference.
        I.e. the more vibe you do the better you need the model especially over long running and large contexts. Claude is heading and shoulders above everyone else in that setting.
      - stavros 36 minutes ago
        When you say Sonnet 4, do you mean literally 4, or 4.6?
    - otabdeveloper4 2 hours ago
      There's diminishing returns bigly when you increase parameter count.
      The sweet spot isn't in the "hundreds of billions" range, it's much lower than that.
      Anyways your perception of a model's "quality" is determined by careful post-training.
      [-]
      - codemog 1 hour ago
        Interesting. I see papers where researchers will finetune models in the 7 to 12b range and even beat or be competitive with frontier models. I wish I knew how this was possible, or had more intuition on such things. If anyone has paper recommendations, I’d appreciate it.
        [-]
        stavros 35 minutes ago
        They're using a revolutionary new method called "training on the test set".
      - zozbot234 2 hours ago
        More parameters improves general knowledge a lot, but you have to quantize more in order to fit in a given amount of memory, which if taken to extremes leads to erratic behavior. For casual chat use even Q2 models can be compelling, agentic use requires more regularization thus less quantized parameters and lowering the total amount to compensate.
    - revolvingthrow 1 hour ago
      It doesn’t. I’m not sure it outperforms chatgpt 3
      [-]
      - BoredomIsFun 21 minutes ago
        You are not being serrious, are you? even 1.5 years old Mistral and Meta models outperform ChatGPT 3.
      - gunalx 28 minutes ago
        3 not 3.5? I think I would even prefer the qwen3.5 0.8b over GPT 3.
    - spwa4 1 hour ago
      The short answer is that there are more things that matter than parameter count, and we are probably nowhere near the most efficient way to make these models. Also: the big AI labs have shown a few times that internally they have way more capable models
  - zozbot234 2 hours ago
    With MoE models, if the complete weights for inactive experts almost fit in RAM you can set up mmap use and they will be streamed from disk when needed. There's obviously a slowdown but it is quite gradual, and even less relevant if you use fast storage.
  - teaearlgraycold 3 hours ago
    Qwen3.5 35B A3B is much much faster and fits if you get a 3 bit version. How fast are you getting 27B to run?
    On my M3 Air w/ 24GB of memory 27B is 2 tok/s but 35B A3B is 14-22 tok/s which is actually usable.
    [-]
    - throwdbaaway 1 hour ago
      Using ik_llama.cpp to run a 27B 4bpw quant on a RTX 3090, I get 1312 tok/s PP and 40.7 tok/s TG at zero context, dropping to 1009 tok/s PP and 36.2 tok/s TG at 40960 context.
      35B A3B is faster but didn't do too well in my limited testing.
    - ece 2 hours ago
      The 27B is rated slightly higher for SWE-bench.
- yangikan 4 hours ago
  Do you point claude code to this? The orchestration seems to be very important.
  [-]
  - tommyjepsen 0 minutes ago
    I ran the Qwen3 Coder 30B through LM Studio and with OpenCode(Instead of Claude code). Did decent on M4 Max 32GB. https://www.tommyjepsen.com/blog/run-llm-locally-for-coding
  - badgersnake 1 hour ago
    I’ve tried it on Claude code, Found it to be fairly crap. It got stuck in a loop doing the wrong thing and would not be talked out of it. I’ve found this bug that would stop it compiling right after compiling it, that sort of thing.
    Also seemed to ignore fairly simple instructions in CLAUDE.md about building and running tests.
  - teaearlgraycold 2 hours ago
    I loaded Qwen into LM Studio and then ran Oh My Pi. It automatically picked up the LM Studio API server. For some reason the 35B A3B model had issues with Oh My Pi's ability to pass a thinking parameter which caused it to crash. 27B did not have that issue for me but it's much slower.
    Here's how I got the 35B model to work: https://gist.github.com/danthedaniel/c1542c65469fb1caafabe13...
    The 35B model is still pretty slow on my machine but it's cool to see it working.
- lukan 3 hours ago
  What exact model are you using?
  I have a 16GB GPU as well, but have never run a local model so far. According to the table in the article, 9B and 8-bit -> 13 GB and 27B and 3-bit seem to fit inside the memory. Or is there more space required for context etc?
  [-]
  - vasquez 2 hours ago
    It depends on the task, but you generally want some context. These models can do things like OCR and summarize a pdf for you, which takes a bit of working memory. Even more so for coding CLIs like opencode-ai, qwen code and mistral ai.
    Inference engines like llama.cpp will offload model and context to system ram for you, at the cost of performance. A MoE like 35B-A3B might serve you better than the ones mentioned, even if it doesn't fit entirely on the GPU. I suggest testing all three. Perhaps even 122-A10B if you have plenty of system ram.
    Q4 is a common baseline for simple tasks on local models. I like to step up to Q5/Q6 for anything involving tool use on the smallish models I can run (9B and 35B-A3B).
    Larger models tolerate lower quants better than small ones, 27B might be usable at 3 bpw where 9B or 4B wouldn't. You can also quantize the context. On llama.cpp you'd set the flags -fa on, -ctk x and ctv y. -h to see valid parameters. K is more sensitive to quantization than V, don't bother lowering it past q8_0. KV quantization is allegedly broken for Qwen 3.5 right now, but I can't tell.
mingodad 1 hour ago
I'm still a bit confused because it says "All uploads use Unsloth Dynamic 2.0" but then when looking at the available options like for 4 bits there is:
IQ4_XS 5.17 GB, Q4_K_S 5.39 GB, IQ4_NL 5.37 GB, Q4_0 5.38 GB, Q4_1 5.84 GB, Q4_K_M 5.68 GB, UD-Q4_K_XL 5.97 GB
And no explanation for what they are and what tradeoffs they have, but in the turorial it explicitly used Q4_K_XL with llama.cpp .
I'm using a macmini m4 16GB and so far my prefered model is Qwen3-4B-Instruct-2507-Q4_K_M although a bit chat but my test with Qwen3.5-4B-UD-Q4_K_XL shows it's a lot more chat, I'm basically using it in chat mode for basic man style questions.
I understand that each user has it's own specific needs but would be nice to have a place that have a list of typical models/hardware listed with it's common config parameters and memory usage.
Even on redit specific channels it's a bit of nightmare of loot of talk but no concrete config/usage clear examples.
I'm floowing this topic heavilly for the last 3 months and I see more confusion than clarification.
Right now I'm getting good cost/benefit results with the qwen cli with coder-model in the cloud and watching constantly to see when a local model on affordable hardware with enviroment firendly energy comsumption arrives.
[-]
- PhilippGille 19 minutes ago
  > would be nice to have a place that have a list of typical models/ hardware listed with it's common config parameters and memory usage
  https://www.localscore.ai from Mozilla Buolders was supposed to be this, but there are not enough users I guess, I didn't find any Qwen 3.5 entries yet
- ay 47 minutes ago
  I tried qwen3.5:4b in ollama on my 4 year old Mac M1 with my own coding harness and it exhibited pretty decent tool calling, but it is a bit slow and seemed a little confused with the more complex tasks (also, I have it code rust, that might add complexity). The task was “find the debug that does X and make it conditional based on the whichever variable is controlled by the CLI ‘/debug foo’” - I didn’t do much with it after that.
  It may be interesting to try a 6bit quant of qwen3.5-35b-a3b - I had pretty good results with it running it on a single 4090 - for obvious reasons I didn’t try it on the old mac.
  I am using 8bit quant of qwen3.5-27b as more or less the main engine for the past ~week and am quite happy with it - but that requires more memory/gpu power.
  HTH.
antirez 1 hour ago
My private benchmarks, using DeepSeek replies to coding problems as a baseline, with Claude Opus as judge. However when reading this percentages consider that the no-think setup is much faster, and may be more practical for most situations.
```
    1   │ DeepSeek API -- 100%
    2   │ qwen3.5:35b-a3b-q8_0 (thinking) -- 92.5%
    3   │ qwen3.5:35b-a3b-q4_K_M (thinking) -- 90.0%
    4   │ qwen3.5:35b-a3b-q8_0 (no-think) -- 81.3%
    5   │ qwen3.5:27b-q8_0 (thinking) -- 75.3%
```
I expected the 27B dense model to score higher. Disclaimer: those numbers are from one-shot replies evaluations, the model was not put in a context where it could reiterate as an agent.
RandomGerm4n 41 minutes ago
9b with 4bits runs with around 60 tok/s on my RTX 4070 with 12GB VRAM and 35b-A3B runs with around 14 tok/s and partial offloading. For roleplaying I prefer the faster 9b Version but for coding tasks both aren't really usable and Claude is still way better especially if you manage to persuade your employer to give you unlimited access.
Curiositry 4 hours ago
Qwen3.5 9b seems to be fairly competent at OCR and text formatting cleanup running in llama.cpp on CPU, albeit slow. However, I have compiled it umpteen ways and still haven't gotten GPU offloading working properly (which I had with Ollama), on an old 1650 Ti with 4GB VRAM (it tries to allocate too much memory).
[-]
- acters 4 hours ago
  I have a 1660ti and the cachyos + aur/llama.cpp-cuda package is working fine for me. With about 5.3 GB of usable memory, I find that the 35B model is by far the most capable one that performs just as fast as the 4B model that fits entirely on my GPU. I did try the 9B model and was surprisingly capable. However 35B still better in some of my own anecdotal test cases. Very happy with the improvement. However, I notice that qwen 3.5 is about half the speed of qwen 3
- WhyNotHugo 3 hours ago
  If you’re building from source, the vulkan backend is the easiest to build and use for GPU offloading.
  [-]
  - Curiositry 3 hours ago
    Yes, that's what I tried first. Same issue with trying to allocate more memory than was available.
_qua 46 minutes ago
For roughly equivalent memory sizes, how does one choose between the bit depth and the model size?
[-]
- moffkalast 29 minutes ago
  As a rule of thumb the larger the model is, the more you can quantize it without losing performance, but smaller models will run faster. It usually always makes sense to pick the larger model at a lower quant, as long as the speed is acceptable. Smaller models also use a smaller KV cache, so longer contexts are more viable. It really depends on what your use case is.
  Imo though, going below 4 bits for anything that's less than 70B is not worth the degradation. BF/FP16 and Q8 are usually indistinguishable except for vision encoders (mmproj) and for really small models, like under 2B.
brainless 37 minutes ago
Local models, particularly the new ones would be really useful in many situations. They are not for general chat but if tools use them in specific agents, the results are awesome.
I built https://github.com/brainless/dwata to submit for Google Gemini Hackathon, and focused on an agent that would replace email content with regex to extract financial data. I used Gemini 3 Flash.
After submitting to the contest, I kept working on branch: reverse-template-based-financial-data-extraction to use Ministral 3:3b. I moved away from regex detection to a reverse template generation. Like Jinja2 syntax but in reverse, from the source email.
Financial data extraction now works OK ish and I am constantly improving this to aim for a launch soon. I will try with Qwen 3.5 Small, maybe 4b model. Both Ministral 3:3b and Qwen 3.5 Small:4b will fit on the smallest Mac Mini M4 or a RTX 3060 6GB (I have these devices). dwata should be able to process all sorts of financial data, transaction and meta-data (vendor, reference #), at a pretty nice speed. Keep it running a couple hours and you can go through 20K or 30K emails. All local!
Twirrim 7 hours ago
I've been finding it very practical to run the 35B-A3B model on an 8GB RTX 3050, it's pretty responsive and doing a good job of the coding tasks I've thrown at it. I need to grab the freshly updated models, the older one seems to occasionally get stuck in a loop with tool use, which they suggest they've fixed.
[-]
- fy20 4 hours ago
  I guess you are doing offloading to system RAM? What tokens per second do you get? I've got an old gaming laptop with a RTX 3060, sounds like it could work well as a local inference server.
  [-]
  - manmal 3 hours ago
    In the article, they claim up to 25t/s for the LARGEST model with a 24GB VRAM card. Need a lot of RAM obviously
- ufish235 6 hours ago
  Can you give an example of some coding tasks? I had no idea local was that good.
  [-]
  - hooch 4 hours ago
    Changed into a directory recently and fired up the qwen code CLI and gave it two prompts: "so what's this then?" - to which it had a good summary across stack and product, and then "think you can find something todo in the TODO?" - and while I was busy in Claude Code on another project, it neatly finished three HTML & CSS tasks - that I had been procrastinating on for weeks.
    This was a qwen3-coder-next 35B model on M4 Max with 64GB which seems to be 51GB size according to ollama. Have not yet tried the variants from the TFA.
    [-]
    - manmal 3 hours ago
      3.5 seems to be better at coding than 3-coder-next, I’d check it out.
- fragmede 6 hours ago
  Which models would that be?
vvram 2 hours ago
What would be optimal HW configurations/systems recommended?
[-]
- speedgoose 1 hour ago
  It depends. Gaming PCs are fine for small models. Apple hardware can run much bigger models without having to open a window to cool down the room. If money isn’t an issue, NVIDIA isn’t that overpriced for no reasons and a server full of NVIDIA AI GPUs is neat.
sieste 2 hours ago
> you can use 'true' and 'false' interchangeably.
made me laugh, especially in the context of LLMs.
b89kim 1 hour ago
I’ve been benchmarking GGUF quants for Python tasks under some hardware configs.
```
  - 4090 : 27b-q4_k_m
  - A100: 27b-q6_k
  - 3*A100: 122b-a10b-q6_k_L
```
Using the Qwen team's "thinking" presets, I found that non-agentic coding performance doesn't feel significant leap over unquantized GPT-OSS-120B. It shows some hallucination and repetition for mujoco codes with default presence penalty. 27b-q4_k_m with 4090 generates 30~35 tok/s in good quality.
KronisLV 1 hour ago
I had an annoying issue in a setup with two Nvidia L4 cards where trying to run the MoE versions to get decent performance just didn't work with Ollama, seems the same as these:
https://github.com/ollama/ollama/issues/14419
https://github.com/ollama/ollama/issues/14503
So for now I'm back to Qwen 3 30B A3B, kind of a bummer, because the previous model is pretty fast but kinda dumb, even for simple tasks like on-prem code review!
krasikra 1 hour ago
[dead]