whisper.oxc.dev API

Speech-to-text API on Cloudflare Workers AI — Deepgram real-time + Whisper finalization + LLM summarization

Authentication Batch Transcription Session Management WebSocket Protocol Swift Client Example Audio Format Requirements

Authentication

All endpoints require an API token. Pass it via:

HTTP — Authorization header

Authorization: Bearer YOUR_API_TOKEN

WebSocket / HTTP — Query parameter

wss://whisper.oxc.dev/session/:id/ws?token=YOUR_API_TOKEN
https://whisper.oxc.dev/api/transcribe?token=YOUR_API_TOKEN

Set the token as an environment variable for CLI usage:
export WHISPER_API_TOKEN="your-uuid-here"

Then use it in curl:
curl -H "Authorization: Bearer $WHISPER_API_TOKEN" ...

Batch Transcription

POST /api/transcribe

Upload an audio file and get back a full transcript with timestamps, segments, VTT, and summary.

Request

Send the audio file as the request body with the appropriate Content-Type, or as multipart form data with field name audio.

curl -X POST https://whisper.oxc.dev/api/transcribe \
  -H "Authorization: Bearer $WHISPER_API_TOKEN" \
  -H "Content-Type: audio/mpeg" \
  --data-binary @recording.mp3

Query params:

Param	Default	Description
`summary`	`true`	Set to `false` to skip LLM summarization

Response

{
  "text": "Full transcript text...",
  "segments": [
    {
      "i": 0,
      "text": "Hello, welcome to the meeting.",
      "start": 0.0,
      "end": 2.4,
      "speaker": null,
      "words": [
        { "word": "Hello,", "start": 0.0, "end": 0.5 },
        { "word": "welcome", "start": 0.6, "end": 1.0 }
      ]
    }
  ],
  "vtt": "WEBVTT\n\n00:00.000 --> 00:02.400\nHello, welcome...",
  "duration": 45.2,
  "summary": "## Summary\nA meeting discussing..."
}

Session Management

Create Session

POST /api/sessions

Creates a new transcription session backed by a Durable Object. Returns the session ID and WebSocket URL.

curl -X POST https://whisper.oxc.dev/api/sessions

Response

{
  "id": "550e8400-e29b-41d4-a716-446655440000",
  "wsUrl": "wss://whisper.oxc.dev/session/550e8400.../ws",
  "status": "recording",
  "segments": [],
  "text": ""
}

Get Session State

GET /api/sessions/:id

Retrieve the current state of a session (useful for reconnection or polling).

{
  "status": "done",
  "segments": [...],
  "text": "Full transcript...",
  "vtt": "WEBVTT\n...",
  "duration": 45.2,
  "summary": "## Summary\n..."
}

WebSocket Protocol

WS /session/:id/ws

Bi-directional WebSocket for real-time streaming transcription.

Client → Server

Message	Format	Description
Audio data	Binary (ArrayBuffer)	Raw PCM int16 mono 16kHz — send in ~5s chunks
`{"type":"stop"}`	JSON string	Stop recording, trigger Whisper finalization + summarization
`{"type":"ping"}`	JSON string	Keep-alive, server responds with `pong`

Server → Client

Type	When	Fields
`state`	On connect, status changes	`status`, `segments[]`, `text`, `vtt?`, `duration?`, `summary?`
`interim`	Real-time (Deepgram, not final)	`text`, `speaker?`
`segment`	Deepgram final segment	`i`, `text`, `speaker?`, `start?`, `end?`
`transcript`	Whisper finalization complete	`text`, `segments[]`, `vtt?`, `duration?`, `status`
`summary`	LLM summary ready (async, after transcript)	`summary`
`error`	Any error	`message`
`pong`	Response to ping	none

Message Flow

Client                          Server
  |                                |
  |--- POST /api/sessions -------->|  Create session
  |<-- { id, wsUrl } --------------|
  |                                |
  |--- WS connect /session/x/ws ->|  Connect
  |<-- { type: "state" } ---------|  Initial state
  |                                |
  |--- binary PCM chunk 1 -------->|  Audio
  |--- binary PCM chunk 2 -------->|
  |<-- { type: "interim" } --------|  Real-time partial
  |<-- { type: "segment", i:0 } ---|  Committed segment
  |--- binary PCM chunk 3 -------->|
  |<-- { type: "interim" } --------|
  |<-- { type: "segment", i:1 } ---|
  |                                |
  |--- { type: "stop" } ---------->|  Stop recording
  |<-- { type: "state", "processing" }  Finalizing...
  |<-- { type: "transcript" } -----|  Whisper final (immediate)
  |                                |
  |   ... a few seconds later ...  |
  |                                |
  |<-- { type: "summary" } --------|  LLM summary (async)
  |                                |

Audio Format

WebSocket Streaming

Format: Raw PCM, signed 16-bit integers, little-endian
Sample rate: 16,000 Hz
Channels: 1 (mono)
Chunk size: ~5 seconds recommended (80,000 samples = 160,000 bytes)

Batch Upload

Accepts any common audio format: WAV, MP3, M4A, OGG, FLAC, WEBM. The server handles decoding.

For iOS: use AVAudioEngine with an output format of 16kHz mono PCM int16. The tap buffer gives you AVAudioPCMBuffer — extract .int16ChannelData and send the raw bytes over the WebSocket.

Swift Client Example

Minimal Swift integration for an iOS app using native URLSessionWebSocketTask and AVAudioEngine:

import AVFoundation
import Foundation

class WhisperClient {
    private let baseURL = "https://whisper.oxc.dev"
    private let apiToken: String
    private var wsTask: URLSessionWebSocketTask?
    private let engine = AVAudioEngine()
    private let session = URLSession.shared

    init(apiToken: String) {
        self.apiToken = apiToken
    }

    var onInterim: ((String, Int?) -> Void)?
    var onSegment: ((Int, String, Int?, Double?) -> Void)?
    var onTranscript: ((TranscriptResult) -> Void)?
    var onSummary: ((String) -> Void)?

    struct SessionInfo: Codable {
        let id: String
        let wsUrl: String
    }

    struct TranscriptResult {
        let text: String
        let segments: [[String: Any]]
        let vtt: String?
        let duration: Double?
    }

    // MARK: - Session lifecycle

    func start() async throws {
        let info = try await createSession()
        connectWebSocket(url: info.wsUrl)
        try startAudioCapture()
    }

    func stop() {
        stopAudioCapture()
        sendJSON(["type": "stop"])
    }

    // MARK: - HTTP

    private func createSession() async throws -> SessionInfo {
        var req = URLRequest(url: URL(string: "\(baseURL)/api/sessions")!)
        req.httpMethod = "POST"
        req.setValue("Bearer \(apiToken)", forHTTPHeaderField: "Authorization")
        let (data, _) = try await session.data(for: req)
        return try JSONDecoder().decode(SessionInfo.self, from: data)
    }

    /// Batch transcribe an audio file
    func transcribe(fileURL: URL) async throws -> Data {
        var req = URLRequest(url: URL(string: "\(baseURL)/api/transcribe")!)
        req.httpMethod = "POST"
        req.setValue("Bearer \(apiToken)", forHTTPHeaderField: "Authorization")
        let data = try Data(contentsOf: fileURL)
        req.httpBody = data
        req.setValue("audio/mpeg", forHTTPHeaderField: "Content-Type")
        let (responseData, _) = try await session.data(for: req)
        return responseData
    }

    // MARK: - WebSocket

    private func connectWebSocket(url: String) {
        let separator = url.contains("?") ? "&" : "?"
        wsTask = session.webSocketTask(with: URL(string: "\(url)\(separator)token=\(apiToken)")!)
        wsTask?.resume()
        receiveMessage()
    }

    private func receiveMessage() {
        wsTask?.receive { [weak self] result in
            guard let self else { return }
            if case .success(let msg) = result {
                if case .string(let text) = msg {
                    self.handleMessage(text)
                }
                self.receiveMessage()
            }
        }
    }

    private func handleMessage(_ text: String) {
        guard let data = text.data(using: .utf8),
              let json = try? JSONSerialization.jsonObject(with: data) as? [String: Any],
              let type = json["type"] as? String else { return }

        switch type {
        case "interim":
            onInterim?(json["text"] as? String ?? "", json["speaker"] as? Int)
        case "segment":
            onSegment?(
                json["i"] as? Int ?? 0,
                json["text"] as? String ?? "",
                json["speaker"] as? Int,
                json["start"] as? Double
            )
        case "transcript":
            onTranscript?(TranscriptResult(
                text: json["text"] as? String ?? "",
                segments: json["segments"] as? [[String: Any]] ?? [],
                vtt: json["vtt"] as? String,
                duration: json["duration"] as? Double
            ))
        case "summary":
            onSummary?(json["summary"] as? String ?? "")
        default:
            break
        }
    }

    private func sendJSON(_ dict: [String: String]) {
        guard let data = try? JSONSerialization.data(withJSONObject: dict),
              let str = String(data: data, encoding: .utf8) else { return }
        wsTask?.send(.string(str)) { _ in }
    }

    private func sendAudio(_ buffer: Data) {
        wsTask?.send(.data(buffer)) { _ in }
    }

    // MARK: - Audio capture

    private func startAudioCapture() throws {
        let inputNode = engine.inputNode
        let nativeFmt = inputNode.outputFormat(forBus: 0)

        let targetFmt = AVAudioFormat(
            commonFormat: .pcmFormatInt16,
            sampleRate: 16000,
            channels: 1,
            interleaved: true
        )!

        guard let converter = AVAudioConverter(from: nativeFmt, to: targetFmt) else {
            throw NSError(domain: "WhisperClient", code: 1,
                          userInfo: [NSLocalizedDescriptionKey: "Cannot create audio converter"])
        }

        var pcmAccumulator = Data()
        let chunkBytes = 16000 * 2 * 5 // 5 seconds of int16 mono

        inputNode.installTap(onBus: 0, bufferSize: 4096, format: nativeFmt) { [weak self] buffer, _ in
            guard let self else { return }

            let ratio = 16000.0 / nativeFmt.sampleRate
            let capacity = AVAudioFrameCount(Double(buffer.frameLength) * ratio)
            guard let outBuf = AVAudioPCMBuffer(pcmFormat: targetFmt, frameCapacity: capacity) else { return }

            var error: NSError?
            converter.convert(to: outBuf, error: &error) { _, outStatus in
                outStatus.pointee = .haveData
                return buffer
            }

            if let channelData = outBuf.int16ChannelData {
                let ptr = UnsafeBufferPointer(start: channelData[0], count: Int(outBuf.frameLength))
                pcmAccumulator.append(UnsafeRawBufferPointer(ptr))
            }

            while pcmAccumulator.count >= chunkBytes {
                let chunk = pcmAccumulator.prefix(chunkBytes)
                pcmAccumulator = pcmAccumulator.dropFirst(chunkBytes) as? Data ?? Data(pcmAccumulator.dropFirst(chunkBytes))
                self.sendAudio(Data(chunk))
            }
        }

        try engine.start()
    }

    private func stopAudioCapture() {
        engine.inputNode.removeTap(onBus: 0)
        engine.stop()
    }
}

Usage

let client = WhisperClient(apiToken: "your-api-token-here")

client.onInterim = { text, speaker in
    print("[\(speaker.map { "Speaker \($0)" } ?? "")]", text)
}

client.onSegment = { i, text, speaker, start in
    print("Segment \(i): \(text)")
}

client.onTranscript = { result in
    print("Final: \(result.text)")
}

client.onSummary = { summary in
    print("Summary: \(summary)")
}

// Real-time streaming
try await client.start()
// ... user records ...
client.stop()

// Or batch transcribe a file
let data = try await client.transcribe(fileURL: audioFileURL)

iOS recommendation: Use Swift native, not React Native. You get direct access to AVAudioEngine for low-latency capture, background audio modes (UIBackgroundModes: audio), Siri Shortcuts integration, and on-device fallback via Apple's Speech framework. The WebSocket client (URLSessionWebSocketTask) is rock-solid and built into Foundation.