.rules/docs/ollama/vision.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87

# Vision (Multimodal Image Input)

Vision-capable models accept images alongside text prompts for description, classification, and visual Q&A.

## How to Send Images

Add an `images` array to a message in `/api/chat` or to the top-level request in `/api/generate`.

**REST API:** Images must be **base64-encoded** strings (no file paths or URLs).

### `/api/chat` example

```json
{
  "model": "gemma3",
  "messages": [{
    "role": "user",
    "content": "What is in this image?",
    "images": ["<base64-encoded image data>"]
  }],
  "stream": false
}
```

### `/api/generate` example

```json
{
  "model": "gemma3",
  "prompt": "Describe this image.",
  "images": ["<base64-encoded image data>"],
  "stream": false
}
```

## Base64 Encoding

In an Obsidian plugin context (TypeScript), convert an `ArrayBuffer` to base64:

```typescript
const buffer = await vault.readBinary(file);
const bytes = new Uint8Array(buffer);
let binary = '';
for (const b of bytes) binary += String.fromCharCode(b);
const base64 = btoa(binary);
```

## Multiple Images

Pass multiple base64 strings in the `images` array. The model will consider all of them in context.

```json
"images": ["<base64_image_1>", "<base64_image_2>"]
```

## Combining with Structured Output

Vision works with the `format` parameter — use a JSON schema to get structured descriptions:

```json
{
  "model": "gemma3",
  "messages": [{
    "role": "user",
    "content": "Describe the objects in this photo.",
    "images": ["<base64>"]
  }],
  "stream": false,
  "format": {
    "type": "object",
    "properties": {
      "objects": {
        "type": "array",
        "items": {"type": "string"}
      }
    },
    "required": ["objects"]
  }
}
```

## Supported Models

Any model with vision/multimodal capability, e.g.:
- `gemma3`
- `llava`
- Browse: [vision models](https://ollama.com/search?c=vision)