Rémy Mathieu


A Coding Benchmark: Step 3.7 vs Qwen 3.5 122B-A10B vs Qwen 3.6 27B vs Qwen 3.6 35B-A3B

Jun 1, 20265 minutes read

Introduction

I recently ran a simple real-world benchmark with 4 different models, all quants from Unsloth:

  • Step 3.7 - IQ4_XS
  • Qwen 3.5 122B-A10B - UD-Q4_K_XL
  • Qwen 3.6 27B - Q5_K_M
  • Qwen 3.6 35B-A3B - UD-Q6_K

The benchmark is quite simple: I have my own coding harness developed in Go, I want to add a /config command displaying the runtime value of all configuration fields, and from where the value came from (environment variable, configuration file or default value).

In order to implement this, I first asked Step 3.7 to write a plan, here it is the plan it wrote for reference:

# PLAN-config.md

## /config Command

Display the effective configuration from inside the TUI so the user can inspect model, endpoint, context limits, and other active settings without quitting.

## Changes

### `internal/app/app.go`

1. **Add `case "config":` in `handleCommand` (line ~809)**

   Parse a `show` subcommand. Default to `show` when no subcommand is given.

   ```go
   case "config":
       sub := "show"
       if len(parts) > 1 {
           sub = parts[1]
       }
       switch sub {
       case "show":
           return m.handleConfigShow(), true
       default:
           msg := types.NewSystemMessage("Usage: /config [show]")
           m.Messages = append(m.Messages, msg)
           m.Screen.GetMessageView().MessageGrew()
           return nil, true
       }
   ```

2. **Add `handleConfigShow` method**

   Format key config values as a readable message. Mask the bearer token (show only last 4 chars or `***`). Indicate the source if it differs from the default (e.g. `[env]`, `[yaml]`, `[cli]`) — the config loader already knows the origin, so expose that via the `Config` struct or a separate `source` field.

   Fields to display:

   | Field | Notes |
   |---|---|
   | `Model` | always shown |
   | `Endpoint` | always shown |
   | `BearerToken` | masked — last 4 chars only |
   | `CWD` | working directory |
   | `MaxContextTokens` | context window size |
   | `CompactionThreshold` | e.g. `0.8` |
   | `KeepRecentTokens` | preserved after compaction |
   | `MaxTokens` | response token limit |
   | `SystemPrompt` | show length + first 80 chars, not the full text |
   | `Debug` | on/off |
   | `UILogPath` | log file location |
   | `EnableMultiline` | on/off |
   | `EnableAutocomplete` | on/off |
   | `AutocompleteMaxItems` | max suggestions |
   | `TabCompletes` | on/off |

3. **Update help text (line ~841)**

   Add:

   ```
   helpText += "  /config        - Show effective configuration\n"
   ```

### `internal/app/config_types.go`

4. **Add `Source` field to `Config`**

   Track where each value came from (default, yaml, env, cli) so `/config show` can display it. Add a `source` map or per-field tags:

   ```go
   type ConfigSource string

   const (
       SourceDefault ConfigSource = "default"
       SourceYAML     ConfigSource = "yaml"
       SourceEnv      ConfigSource = "env"
       SourceCLI      ConfigSource = "cli"
   )
   ```

   Either embed a `map[string]ConfigSource` in `Config` or add a `GetSource(key string) ConfigSource` method. Keep it minimal — only needed for display.

### `internal/app/config.go`

5. **Populate source tracking in `LoadConfig` / `ApplyFlags`**

   When a value is overridden, record the source. The loader already has the precedence chain (yaml → env → cli), so this is a straightforward addition alongside the existing assignment logic.

## Notes

- Bearer token masking: show `...xxxx` or empty string if not set. Never print the full token.
- System prompt: only show length and a truncated preview to avoid flooding the message view.
- Source display is additive — a `[env]` or `[cli]` tag next to the value makes it easy to understand why a non-default value is active.
- Keep it read-only. No config mutation from `/config` for now (that's a future stretch).

The first observation I can make is that the Step 3.7 prose is really different from Qwen’s prose. Not that it is any better or worse, but it is a bit different and I like it.

I then used llama.cpp on a Strix Halo to implement this plan, one model at a time.

Results and interpretation

Criteria Step 3.7 Qwen 3.5 122B-A10B Qwen 3.6 27B Qwen 3.6 35B-A3B
Quant IQ4_XS UD-Q4_K_XL Q5_K_M UD-Q6_K
Token Counts 69555 67814 68729 129280
Duration (wall time) 24m36s 11m51s 15m04s 15m19s
Fully implemented? Yes Yes Yes Yes
MTP supported in llama.cpp? No Yes Yes Yes
Extra observation N/A Slightly better than the others on adapting unit tests (simpler) Has added /config to the commands autocompletion list N/A

It is interesting to see they all ended up generating a working implementation and with extremely similar results: it means that comparing their wall time and token counts make sense here in this use-case.

For that exact use-case, I personally consider Qwen 3.5 122B-A10B the best for me:

  1. its wall-time is the shortest,
  2. adding /config in the auto complete was not part of the plan, even though it sounds like Qwen 3.6 27B did better by doing this, it is a deviation from the plan!

This is my verdict because I supervise the model results and I expect them to fully follow the plan when there is one. For someone purely vibe-coding everything and/or not looking at results in depth, Qwen 3.6 27B could be a better choice since it did slightly better by going the extra mile.

Some variance here and there could be a fluke, or could be a trait of the model or quant. In order to have something more reproducible in the future, I’ll develop a testbench with a set of E2E tests and a list of criteria that will be interpreted by another [big] LLM. Stay tuned!


Back to posts