Speech-to-text API on Cloudflare Workers AI — Deepgram real-time + Whisper finalization + LLM summarization
All endpoints require an API token. Pass it via:
Authorization: Bearer YOUR_API_TOKEN
wss://whisper.oxc.dev/session/:id/ws?token=YOUR_API_TOKEN https://whisper.oxc.dev/api/transcribe?token=YOUR_API_TOKEN
export WHISPER_API_TOKEN="your-uuid-here"curl -H "Authorization: Bearer $WHISPER_API_TOKEN" ...
/api/transcribe
Upload an audio file and get back a full transcript with timestamps, segments, VTT, and summary.
Send the audio file as the request body with the appropriate Content-Type, or as multipart form data with field name audio.
curl -X POST https://whisper.oxc.dev/api/transcribe \ -H "Authorization: Bearer $WHISPER_API_TOKEN" \ -H "Content-Type: audio/mpeg" \ --data-binary @recording.mp3
Query params:
| Param | Default | Description |
|---|---|---|
summary | true | Set to false to skip LLM summarization |
{
"text": "Full transcript text...",
"segments": [
{
"i": 0,
"text": "Hello, welcome to the meeting.",
"start": 0.0,
"end": 2.4,
"speaker": null,
"words": [
{ "word": "Hello,", "start": 0.0, "end": 0.5 },
{ "word": "welcome", "start": 0.6, "end": 1.0 }
]
}
],
"vtt": "WEBVTT\n\n00:00.000 --> 00:02.400\nHello, welcome...",
"duration": 45.2,
"summary": "## Summary\nA meeting discussing..."
}
/api/sessions
Creates a new transcription session backed by a Durable Object. Returns the session ID and WebSocket URL.
curl -X POST https://whisper.oxc.dev/api/sessions
{
"id": "550e8400-e29b-41d4-a716-446655440000",
"wsUrl": "wss://whisper.oxc.dev/session/550e8400.../ws",
"status": "recording",
"segments": [],
"text": ""
}
/api/sessions/:id
Retrieve the current state of a session (useful for reconnection or polling).
{
"status": "done",
"segments": [...],
"text": "Full transcript...",
"vtt": "WEBVTT\n...",
"duration": 45.2,
"summary": "## Summary\n..."
}
/session/:id/ws
Bi-directional WebSocket for real-time streaming transcription.
| Message | Format | Description |
|---|---|---|
| Audio data | Binary (ArrayBuffer) | Raw PCM int16 mono 16kHz — send in ~5s chunks |
{"type":"stop"} | JSON string | Stop recording, trigger Whisper finalization + summarization |
{"type":"ping"} | JSON string | Keep-alive, server responds with pong |
| Type | When | Fields |
|---|---|---|
state |
On connect, status changes | status, segments[], text, vtt?, duration?, summary? |
interim |
Real-time (Deepgram, not final) | text, speaker? |
segment |
Deepgram final segment | i, text, speaker?, start?, end? |
transcript |
Whisper finalization complete | text, segments[], vtt?, duration?, status |
summary |
LLM summary ready (async, after transcript) | summary |
error |
Any error | message |
pong |
Response to ping | none |
Client Server
| |
|--- POST /api/sessions -------->| Create session
|<-- { id, wsUrl } --------------|
| |
|--- WS connect /session/x/ws ->| Connect
|<-- { type: "state" } ---------| Initial state
| |
|--- binary PCM chunk 1 -------->| Audio
|--- binary PCM chunk 2 -------->|
|<-- { type: "interim" } --------| Real-time partial
|<-- { type: "segment", i:0 } ---| Committed segment
|--- binary PCM chunk 3 -------->|
|<-- { type: "interim" } --------|
|<-- { type: "segment", i:1 } ---|
| |
|--- { type: "stop" } ---------->| Stop recording
|<-- { type: "state", "processing" } Finalizing...
|<-- { type: "transcript" } -----| Whisper final (immediate)
| |
| ... a few seconds later ... |
| |
|<-- { type: "summary" } --------| LLM summary (async)
| |
Accepts any common audio format: WAV, MP3, M4A, OGG, FLAC, WEBM. The server handles decoding.
AVAudioEngine with an output format of 16kHz mono PCM int16. The tap buffer gives you AVAudioPCMBuffer — extract .int16ChannelData and send the raw bytes over the WebSocket.
Minimal Swift integration for an iOS app using native URLSessionWebSocketTask and AVAudioEngine:
import AVFoundation
import Foundation
class WhisperClient {
private let baseURL = "https://whisper.oxc.dev"
private let apiToken: String
private var wsTask: URLSessionWebSocketTask?
private let engine = AVAudioEngine()
private let session = URLSession.shared
init(apiToken: String) {
self.apiToken = apiToken
}
var onInterim: ((String, Int?) -> Void)?
var onSegment: ((Int, String, Int?, Double?) -> Void)?
var onTranscript: ((TranscriptResult) -> Void)?
var onSummary: ((String) -> Void)?
struct SessionInfo: Codable {
let id: String
let wsUrl: String
}
struct TranscriptResult {
let text: String
let segments: [[String: Any]]
let vtt: String?
let duration: Double?
}
// MARK: - Session lifecycle
func start() async throws {
let info = try await createSession()
connectWebSocket(url: info.wsUrl)
try startAudioCapture()
}
func stop() {
stopAudioCapture()
sendJSON(["type": "stop"])
}
// MARK: - HTTP
private func createSession() async throws -> SessionInfo {
var req = URLRequest(url: URL(string: "\(baseURL)/api/sessions")!)
req.httpMethod = "POST"
req.setValue("Bearer \(apiToken)", forHTTPHeaderField: "Authorization")
let (data, _) = try await session.data(for: req)
return try JSONDecoder().decode(SessionInfo.self, from: data)
}
/// Batch transcribe an audio file
func transcribe(fileURL: URL) async throws -> Data {
var req = URLRequest(url: URL(string: "\(baseURL)/api/transcribe")!)
req.httpMethod = "POST"
req.setValue("Bearer \(apiToken)", forHTTPHeaderField: "Authorization")
let data = try Data(contentsOf: fileURL)
req.httpBody = data
req.setValue("audio/mpeg", forHTTPHeaderField: "Content-Type")
let (responseData, _) = try await session.data(for: req)
return responseData
}
// MARK: - WebSocket
private func connectWebSocket(url: String) {
let separator = url.contains("?") ? "&" : "?"
wsTask = session.webSocketTask(with: URL(string: "\(url)\(separator)token=\(apiToken)")!)
wsTask?.resume()
receiveMessage()
}
private func receiveMessage() {
wsTask?.receive { [weak self] result in
guard let self else { return }
if case .success(let msg) = result {
if case .string(let text) = msg {
self.handleMessage(text)
}
self.receiveMessage()
}
}
}
private func handleMessage(_ text: String) {
guard let data = text.data(using: .utf8),
let json = try? JSONSerialization.jsonObject(with: data) as? [String: Any],
let type = json["type"] as? String else { return }
switch type {
case "interim":
onInterim?(json["text"] as? String ?? "", json["speaker"] as? Int)
case "segment":
onSegment?(
json["i"] as? Int ?? 0,
json["text"] as? String ?? "",
json["speaker"] as? Int,
json["start"] as? Double
)
case "transcript":
onTranscript?(TranscriptResult(
text: json["text"] as? String ?? "",
segments: json["segments"] as? [[String: Any]] ?? [],
vtt: json["vtt"] as? String,
duration: json["duration"] as? Double
))
case "summary":
onSummary?(json["summary"] as? String ?? "")
default:
break
}
}
private func sendJSON(_ dict: [String: String]) {
guard let data = try? JSONSerialization.data(withJSONObject: dict),
let str = String(data: data, encoding: .utf8) else { return }
wsTask?.send(.string(str)) { _ in }
}
private func sendAudio(_ buffer: Data) {
wsTask?.send(.data(buffer)) { _ in }
}
// MARK: - Audio capture
private func startAudioCapture() throws {
let inputNode = engine.inputNode
let nativeFmt = inputNode.outputFormat(forBus: 0)
let targetFmt = AVAudioFormat(
commonFormat: .pcmFormatInt16,
sampleRate: 16000,
channels: 1,
interleaved: true
)!
guard let converter = AVAudioConverter(from: nativeFmt, to: targetFmt) else {
throw NSError(domain: "WhisperClient", code: 1,
userInfo: [NSLocalizedDescriptionKey: "Cannot create audio converter"])
}
var pcmAccumulator = Data()
let chunkBytes = 16000 * 2 * 5 // 5 seconds of int16 mono
inputNode.installTap(onBus: 0, bufferSize: 4096, format: nativeFmt) { [weak self] buffer, _ in
guard let self else { return }
let ratio = 16000.0 / nativeFmt.sampleRate
let capacity = AVAudioFrameCount(Double(buffer.frameLength) * ratio)
guard let outBuf = AVAudioPCMBuffer(pcmFormat: targetFmt, frameCapacity: capacity) else { return }
var error: NSError?
converter.convert(to: outBuf, error: &error) { _, outStatus in
outStatus.pointee = .haveData
return buffer
}
if let channelData = outBuf.int16ChannelData {
let ptr = UnsafeBufferPointer(start: channelData[0], count: Int(outBuf.frameLength))
pcmAccumulator.append(UnsafeRawBufferPointer(ptr))
}
while pcmAccumulator.count >= chunkBytes {
let chunk = pcmAccumulator.prefix(chunkBytes)
pcmAccumulator = pcmAccumulator.dropFirst(chunkBytes) as? Data ?? Data(pcmAccumulator.dropFirst(chunkBytes))
self.sendAudio(Data(chunk))
}
}
try engine.start()
}
private func stopAudioCapture() {
engine.inputNode.removeTap(onBus: 0)
engine.stop()
}
}
let client = WhisperClient(apiToken: "your-api-token-here")
client.onInterim = { text, speaker in
print("[\(speaker.map { "Speaker \($0)" } ?? "")]", text)
}
client.onSegment = { i, text, speaker, start in
print("Segment \(i): \(text)")
}
client.onTranscript = { result in
print("Final: \(result.text)")
}
client.onSummary = { summary in
print("Summary: \(summary)")
}
// Real-time streaming
try await client.start()
// ... user records ...
client.stop()
// Or batch transcribe a file
let data = try await client.transcribe(fileURL: audioFileURL)
AVAudioEngine for low-latency capture, background audio modes (UIBackgroundModes: audio), Siri Shortcuts integration, and on-device fallback via Apple's Speech framework. The WebSocket client (URLSessionWebSocketTask) is rock-solid and built into Foundation.