ENDPOINT_ROUTING.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189

# Endpoint Routing — How the adapter picks `/v1/chat/completions` vs `/v1/responses`

> **TL;DR** The decision is made by a single regex against the model id string.
> No capability discovery, no flag, no per-request override.

## The decision

The full routing logic lives in **one method**:

`lib/dispatch/adapter/copilot.rb`

```ruby
# Returns true when the selected model requires the /v1/responses endpoint.
# This applies to GPT-5 reasoning models. These models reject tool calls on
# /v1/chat/completions and return a 400 RequestError directing callers to
# use /v1/responses instead.
def uses_responses_api?
  @model.match?(/\Agpt-5/)
end
```

`\A` anchors at the start of the string, so any model id whose name begins
with the literal `gpt-5` (case-sensitive) is routed to the Responses API.
Everything else goes to Chat Completions.

The check is invoked once per `#chat` call:

```ruby
# lib/dispatch/adapter/copilot.rb (inside #chat)
if uses_responses_api?
  if stream
    chat_streaming_responses(...)        # POST /v1/responses (SSE)
  else
    chat_non_streaming_responses(...)    # POST /v1/responses
  end
else
  # build chat-completions body
  if stream
    chat_streaming(...)                  # POST /v1/chat/completions (SSE)
  else
    chat_non_streaming(...)              # POST /v1/chat/completions
  end
end
```

The four code paths are:

| Path | Method | Endpoint | Streamed? |
|---|---|---|---|
| Responses, streaming | `chat_streaming_responses` | `POST /v1/responses` | yes |
| Responses, blocking | `chat_non_streaming_responses` | `POST /v1/responses` | no |
| Chat, streaming | `chat_streaming` | `POST /v1/chat/completions` | yes |
| Chat, blocking | `chat_non_streaming` | `POST /v1/chat/completions` | no |

All four live in `lib/dispatch/adapter/copilot.rb`.

## Body-shape differences (what the adapter rewrites silently)

| Concept | `/v1/chat/completions` body | `/v1/responses` body |
|---|---|---|
| Conversation | `messages: [...]` | `input: [...]` |
| Token cap | `max_tokens` (or `max_completion_tokens` on o*/gpt-5/gemini) | `max_output_tokens` |
| Reasoning effort | `reasoning_effort: "high"` | `reasoning: { effort: "high" }` |
| Tool definition | `{ type: "function", function: { name, description, parameters } }` | `{ type: "function", name, description, parameters }` (no `function:` wrapper) |

These transforms are handled inside the adapter — callers always pass the
same `Dispatch::Adapter::ToolDefinition` / `Dispatch::Adapter::Message`
structs and the same `thinking:` keyword.

## Current model list and routing

Source: `reference/models.txt` (lives one level up from this gem, in the
parent `update-adapters/` workspace; format is `model_id,premium_multiplier`).

| Model id | Premium multiplier | `\Agpt-5` match? | Endpoint |
|---|---|---|---|
| gpt-4.1 | 0.0 | ❌ | `/v1/chat/completions` |
| gpt-4o | 0.0 | ❌ | `/v1/chat/completions` |
| gpt-5-mini | 0.0 | ✅ | `/v1/responses` |
| oswe-vscode-prime | 0.0 | ❌ | `/v1/chat/completions` |
| grok-code-fast-1 | 0.25 | ❌ | `/v1/chat/completions` |
| claude-haiku-4.5 | 0.33 | ❌ | `/v1/chat/completions` |
| gemini-3-flash-preview | 0.33 | ❌ | `/v1/chat/completions` |
| gpt-5.4-mini | 0.33 | ✅ | `/v1/responses` |
| claude-sonnet-4 | 1.0 | ❌ | `/v1/chat/completions` |
| claude-sonnet-4.5 | 1.0 | ❌ | `/v1/chat/completions` |
| claude-sonnet-4.6 | 1.0 | ❌ | `/v1/chat/completions` |
| gemini-2.5-pro | 1.0 | ❌ | `/v1/chat/completions` |
| gemini-3.1-pro-preview | 1.0 | ❌ | `/v1/chat/completions` |
| gpt-5.2 | 1.0 | ✅ | `/v1/responses` |
| gpt-5.2-codex | 1.0 | ✅ | `/v1/responses` |
| gpt-5.3-codex | 1.0 | ✅ | `/v1/responses` |
| gpt-5.4 | 1.0 | ✅ | `/v1/responses` |
| claude-opus-4.7 | 7.5 | ❌ | `/v1/chat/completions` |
| gpt-5.5 | 7.5 | ✅ | `/v1/responses` |

## Why a regex and not capability discovery?

`GET https://api.githubcopilot.com/models` does NOT return a field that
indicates which endpoint a given model accepts. A typical entry looks like:

```json
{
  "id": "claude-3.7-sonnet",
  "vendor": "Anthropic",
  "model_picker_enabled": true,
  "policy": { "state": "enabled" },
  "capabilities": {
    "family": "claude-3.7-sonnet",
    "type": "chat",
    "tokenizer": "o200k_base",
    "limits": { "max_context_window_tokens": 200000, "max_output_tokens": 8192, "max_prompt_tokens": 90000 },
    "supports": { "streaming": true, "tool_calls": true, "parallel_tool_calls": true, "vision": true }
  }
}
```

There is no `endpoints`, `api`, `responses_api`, or `chat_completions`
flag. The signal that a model needs `/v1/responses` is the **400 error
string** Copilot returns when you send tools + reasoning_effort to
`/v1/chat/completions` for a GPT-5 family model:

```
Function tools with reasoning_effort are not supported for gpt-5.4 in
/v1/chat/completions. Please use /v1/responses instead.
```

Hence the hardcoded `/\Agpt-5/` heuristic. See
`GPT5_RESPONSES_API.md` for the original problem statement.

## How to update this when GitHub adds new models

When GitHub Copilot adds a new model that requires `/v1/responses`:

1. **Edit the regex** in
   `lib/dispatch/adapter/copilot.rb` at the `uses_responses_api?` method.
   Add the new family to the alternation, e.g.:

   ```ruby
   def uses_responses_api?
     @model.match?(/\A(?:gpt-5|gpt-6|codex-6|o5)/)
   end
   ```

2. **Update the test expectations** in
   `spec/dispatch/adapter/copilot_spec.rb`. Search for `uses_responses_api`
   and `/\Agpt-5/` to find the relevant examples; both positive (a model
   that should match) and negative (a model that shouldn't) cases need
   updating.

3. **Update the table above** in this file
   (`ENDPOINT_ROUTING.md`) so the documented routing matches the code.

4. **Update `reference/models.txt`** in the parent workspace if you also
   want the new model listed for build/test scripts.

5. **Bump the gem version** in
   `lib/dispatch/adapter/version.rb` (minor bump for new model support,
   patch for a regex tweak that just fixes routing for an existing
   misclassified model).

6. **Run the test gate** from inside this gem:
   ```bash
   bundle exec rubocop --autocorrect-all
   bundle exec rspec
   ```
   Both must exit 0.

## Alternative: probe-and-fallback (not currently implemented)

A more durable design would catch the specific 400 error string from
`/v1/chat/completions`, cache the offending model id, and retransmit on
`/v1/responses`. Pros: zero hardcoded list. Cons: adds latency on the
first request per new model per process and depends on the upstream
error wording staying stable. The probe must include a tool definition
to be reliable — sending a tool-less request to `/v1/chat/completions`
will succeed for some GPT-5 variants and only the tools+reasoning combo
triggers the rejection.

## File reference (everything routing-related)

| Path | What it contains |
|---|---|
| `lib/dispatch/adapter/copilot.rb` | `uses_responses_api?` (the regex), the `chat` dispatcher, all four code paths, body builders for both endpoints |
| `lib/dispatch/adapter/version.rb` | Gem version constant |
| `spec/dispatch/adapter/copilot_spec.rb` | Tests for both endpoint paths and the routing predicate |
| `GPT5_RESPONSES_API.md` | Original problem statement — the 400 error from Copilot |
| `ENDPOINT_ROUTING.md` | This file |
| `../models.txt` | Workspace-level list of model ids and premium multipliers |