A Coding Benchmark: Step 3.7 vs Qwen 3.5 122B-A10B vs Qwen 3.6 27B vs Qwen 3.6 35B-A3B
Introduction
I recently ran a simple real-world benchmark with 4 different models, all quants from Unsloth:
- Step 3.7 - IQ4_XS
- Qwen 3.5 122B-A10B - UD-Q4_K_XL
- Qwen 3.6 27B - Q5_K_M
- Qwen 3.6 35B-A3B - UD-Q6_K
The benchmark is quite simple: I have my own coding harness developed in Go, I want to add a /config command displaying the runtime value of all configuration fields, and from where the value came from (environment variable, configuration file or default value).
In order to implement this, I first asked Step 3.7 to write a plan, here it is the plan it wrote for reference:
# PLAN-config.md
## /config Command
Display the effective configuration from inside the TUI so the user can inspect model, endpoint, context limits, and other active settings without quitting.
## Changes
### `internal/app/app.go`
1. **Add `case "config":` in `handleCommand` (line ~809)**
Parse a `show` subcommand. Default to `show` when no subcommand is given.
```go
case "config":
sub := "show"
if len(parts) > 1 {
sub = parts[1]
}
switch sub {
case "show":
return m.handleConfigShow(), true
default:
msg := types.NewSystemMessage("Usage: /config [show]")
m.Messages = append(m.Messages, msg)
m.Screen.GetMessageView().MessageGrew()
return nil, true
}
```
2. **Add `handleConfigShow` method**
Format key config values as a readable message. Mask the bearer token (show only last 4 chars or `***`). Indicate the source if it differs from the default (e.g. `[env]`, `[yaml]`, `[cli]`) — the config loader already knows the origin, so expose that via the `Config` struct or a separate `source` field.
Fields to display:
| Field | Notes |
|---|---|
| `Model` | always shown |
| `Endpoint` | always shown |
| `BearerToken` | masked — last 4 chars only |
| `CWD` | working directory |
| `MaxContextTokens` | context window size |
| `CompactionThreshold` | e.g. `0.8` |
| `KeepRecentTokens` | preserved after compaction |
| `MaxTokens` | response token limit |
| `SystemPrompt` | show length + first 80 chars, not the full text |
| `Debug` | on/off |
| `UILogPath` | log file location |
| `EnableMultiline` | on/off |
| `EnableAutocomplete` | on/off |
| `AutocompleteMaxItems` | max suggestions |
| `TabCompletes` | on/off |
3. **Update help text (line ~841)**
Add:
```
helpText += " /config - Show effective configuration\n"
```
### `internal/app/config_types.go`
4. **Add `Source` field to `Config`**
Track where each value came from (default, yaml, env, cli) so `/config show` can display it. Add a `source` map or per-field tags:
```go
type ConfigSource string
const (
SourceDefault ConfigSource = "default"
SourceYAML ConfigSource = "yaml"
SourceEnv ConfigSource = "env"
SourceCLI ConfigSource = "cli"
)
```
Either embed a `map[string]ConfigSource` in `Config` or add a `GetSource(key string) ConfigSource` method. Keep it minimal — only needed for display.
### `internal/app/config.go`
5. **Populate source tracking in `LoadConfig` / `ApplyFlags`**
When a value is overridden, record the source. The loader already has the precedence chain (yaml → env → cli), so this is a straightforward addition alongside the existing assignment logic.
## Notes
- Bearer token masking: show `...xxxx` or empty string if not set. Never print the full token.
- System prompt: only show length and a truncated preview to avoid flooding the message view.
- Source display is additive — a `[env]` or `[cli]` tag next to the value makes it easy to understand why a non-default value is active.
- Keep it read-only. No config mutation from `/config` for now (that's a future stretch).
The first observation I can make is that the Step 3.7 prose is really different from Qwen’s prose. Not that it is any better or worse, but it is a bit different and I like it.
I then used llama.cpp on a Strix Halo to implement this plan, one model at a time.
Results and interpretation
| Criteria | Step 3.7 | Qwen 3.5 122B-A10B | Qwen 3.6 27B | Qwen 3.6 35B-A3B |
|---|---|---|---|---|
| Quant | IQ4_XS | UD-Q4_K_XL | Q5_K_M | UD-Q6_K |
| Token Counts | 69555 | 67814 | 68729 | 129280 |
| Duration (wall time) | 24m36s | 11m51s | 15m04s | 15m19s |
| Fully implemented? | Yes | Yes | Yes | Yes |
| MTP supported in llama.cpp? | No | Yes | Yes | Yes |
| Extra observation | N/A | Slightly better than the others on adapting unit tests (simpler) | Has added /config to the commands autocompletion list | N/A |
It is interesting to see they all ended up generating a working implementation and with extremely similar results: it means that comparing their wall time and token counts make sense here in this use-case.
For that exact use-case, I personally consider Qwen 3.5 122B-A10B the best for me:
- its wall-time is the shortest,
- adding
/configin the auto complete was not part of the plan, even though it sounds likeQwen 3.6 27Bdid better by doing this, it is a deviation from the plan!
This is my verdict because I supervise the model results and I expect them to fully follow the plan when there is one. For someone purely vibe-coding everything and/or not looking at results in depth, Qwen 3.6 27B could be a better choice since it did slightly better by going the extra mile.
Some variance here and there could be a fluke, or could be a trait of the model or quant. In order to have something more reproducible in the future, I’ll develop a testbench with a set of E2E tests and a list of criteria that will be interpreted by another [big] LLM. Stay tuned!