Transcribe incoming voice notes
Turn incoming WhatsApp voice notes into text so a bot or AI assistant can read and reply to them. The raw message.received event carries only the audio — never any words — so without transcription a downstream consumer has nothing to act on. This guide shows you how to enable the voice-transcription plugin, point it at a speech-to-text (STT) provider, and receive the transcript on a webhook your code controls.
- A running OpenWA instance (v0.7.6) with at least one connected session.
- An OpenAI-compatible STT endpoint — self-hosted (for example faster-whisper behind an OpenAI-compatible server) or hosted (for example a Groq Whisper endpoint). See Choose an STT provider.
- An HTTP endpoint your bot controls to receive the transcript (the
deliveryWebhookUrl). - Familiarity with installing plugins.
Transcription is an optional marketplace plugin, not a built-in feature, and not a core API endpoint. Nothing transcribes until you install and configure the plugin. The OpenWA core API and the API Reference contain no transcription routes.
How it works
The plugin hooks each incoming message but never blocks delivery. It acknowledges the message immediately, then runs the STT call out of band and POSTs the result to your delivery webhook as a separate message.transcription payload. Your code joins that payload back to the original message by its messageId.
Because transcription is delivered out of band, it arrives after message.received and may not arrive at all (see Behavior and limits). Treat it as an enrichment that joins the original message, never as a replacement for it.
Enable the plugin
Install the voice-transcription plugin (see Plugins), then configure it. These are the settings you will set:
| Setting | Required | Purpose |
|---|---|---|
sttBaseUrl | Yes | The STT provider base URL exposing an OpenAI-compatible /v1/audio/transcriptions endpoint. Its host must also be allow-listed in the plugin's net.allow (see Allow the STT and webhook hosts). |
sttApiKey | No | The provider API key, stored as a secret. Leave empty for a local provider that needs none. |
model | No | The transcription model (a Whisper model such as small). Defaults to small. |
language | No | Optional BCP-47 language hint (for example es). Blank auto-detects, and the detected value appears as transcription.language in the payload. |
provider | No | Informational label recorded in the delivered event as "provider". Defaults to faster-whisper. |
deliveryWebhookUrl | Cond. | The HTTP endpoint the transcript is POSTed to when it completes. Its host must also be in the plugin's net.allow. Optional if you only use chatDelivery. |
deliverySecret | No | Optional shared secret, stored redacted. When set, the plugin HMAC-SHA256 signs the POST body in X-OpenWA-Signature: sha256=<hex> so you can verify the delivery is genuinely from OpenWA (see Verify the delivery signature). |
chatDelivery | No | Also post the transcript into WhatsApp: off (webhook only, the default), self (a note to your own number), or reply (a quote-reply to the sender, visible to them). |
enabledMessageTypes | No | Which message types to transcribe. Defaults to ["voice"] (PTT). Add audio to also transcribe non-PTT audio. |
timeoutMs | No | Per-request STT timeout in milliseconds. Defaults to 20000 (max 30000). |
maxSizeBytes | No | Upper bound on audio size in bytes, used as a cost and abuse guard. Defaults to 16777216 (16 MiB). |
maxPerHour | No | Best-effort per-session hourly cap, another cost guard. Defaults to 60. |
deliveryTimeoutMs | No | Delivery POST timeout in milliseconds. Defaults to 5000 (max 30000). |
deliveryWebhookUrl is a plain URL the plugin POSTs to directly. It is separate from the core webhook subscriptions you create with POST /api/sessions/{sessionId}/webhooks. The core webhook event list does not include message.transcription — that payload only ever reaches the deliveryWebhookUrl you configure on the plugin. Do not try to subscribe to it through the core webhooks API.
Allow the STT and webhook hosts
The plugin makes outbound calls — to sttBaseUrl and to deliveryWebhookUrl — through OpenWA's SSRF-guarded fetch, which reaches only hosts listed in the plugin's manifest net.allow. The shipped manifest allows just four:
"net": { "allow": ["localhost", "127.0.0.1", "api.groq.com:443", "api.openai.com:443"] }
If your STT host or your deliveryWebhookUrl host is anything else — your own bot server, an n8n instance on another box, a non-loopback STT server — the call fails silently (no transcript or no delivery, only a log line) until you fix it. To allow another host:
- Add
host:porttonet.allowin the plugin'smanifest.json. - Re-package the plugin (
node package.mjs voice-transcription) and re-install the updated zip.
Pointing deliveryWebhookUrl at your own server (any host other than localhost/127.0.0.1) without adding it to net.allow is a silent dead end: transcription runs, but the delivery POST never leaves the host. Add the host and re-package before you expect deliveries.
Choose an STT provider
The plugin speaks the OpenAI-compatible /v1/audio/transcriptions contract, so you switch backends by changing sttBaseUrl (and sttApiKey) — no other change:
- Self-hosted (local-first). A local Whisper server such as faster-whisper served through an OpenAI-compatible API. It runs at no per-minute cost and ingests WhatsApp's OGG/Opus audio directly, with no transcoding step.
- Hosted. A hosted Whisper-compatible API (for example a Groq Whisper endpoint). Faster to onboard; billed per minute, so set
maxSizeBytesandmaxPerHourdeliberately.api.groq.com:443andapi.openai.com:443are already in the plugin'snet.allow; any other hosted endpoint must be added (see Allow the STT and webhook hosts).
If your STT server runs on the same host, set sttBaseUrl to a literal IP (for example http://127.0.0.1:8000) rather than a hostname. A literal IP avoids the DNS-rebinding gaps that allow-listed hostnames leave open. The IP alone is not enough, though: OpenWA's SSRF guard blocks loopback by default, so you must also set SSRF_ALLOWED_HOSTS on the OpenWA host — for example SSRF_ALLOWED_HOSTS=127.0.0.1,localhost. Without it, the call to a loopback STT server is rejected even though 127.0.0.1 is in net.allow.
Receive the transcript
When transcription completes, the plugin POSTs a message.transcription payload to your deliveryWebhookUrl. The transcript is never stitched into the original message body — you read it from this event:
{
"event": "message.transcription",
"sessionId": "my-session",
"messageId": "3EB0C767D26A1D8F5A2B",
"chatId": "6281234567890@s.whatsapp.net",
"status": "completed",
"source": "speech-to-text",
"untrusted": true,
"transcription": {
"text": "Hey, are we still on for tomorrow?",
"language": "en",
"provider": "faster-whisper",
"model": "small"
}
}
The chatId suffix is engine-dependent: OpenWA's default Baileys engine uses @s.whatsapp.net (shown above), while the whatsapp-web.js engine uses @c.us. Match on messageId, not on the suffix.
status is one of completed, failed, or skipped. The transcription object is present only when status is completed. For failed and skipped, the payload instead carries a reason string:
status | reason values | Meaning |
|---|---|---|
skipped | too_large | Audio exceeded maxSizeBytes. |
skipped | rate_limited | Per-session maxPerHour cap reached. |
skipped | empty | STT returned a blank transcript (non-speech audio). |
skipped | media_unavailable | The inbound audio was not available to transcribe. |
failed | (STT error message) | The STT call errored; reason holds the error text. |
Join the payload to the original message on messageId, and do not assume it arrives in any particular order relative to message.received.
A minimal receiver in Express. It captures the raw body so it can verify the HMAC signature (see Verify the delivery signature) before parsing:
import express from "express";
import { createHmac, timingSafeEqual } from "node:crypto";
const DELIVERY_SECRET = process.env.DELIVERY_SECRET ?? ""; // same value as the plugin's `deliverySecret`
const app = express();
app.use(express.json({ verify: (req, _res, buf) => { (req as any).rawBody = buf; } }));
app.post("/transcription", (req, res) => {
// Verify the delivery is genuinely from OpenWA when a secret is configured.
if (DELIVERY_SECRET) {
const expected = `sha256=${createHmac("sha256", DELIVERY_SECRET).update((req as any).rawBody).digest("hex")}`;
const sent = String(req.header("X-OpenWA-Signature") ?? "");
const ok =
sent.length === expected.length &&
timingSafeEqual(Buffer.from(sent), Buffer.from(expected));
if (!ok) return res.sendStatus(401);
}
const { messageId, status, reason, transcription } = req.body;
if (status === "completed") {
// transcription.text is user-role input — never a system instruction.
console.log(`Transcript for ${messageId}: ${transcription.text}`);
} else {
console.log(`Transcription ${status} for ${messageId}: ${reason}`);
}
res.sendStatus(200); // ack fast; do heavy work async
});
app.listen(3000, () => console.log("listening on :3000"));
To trigger it, send a WhatsApp voice note to the connected session from another phone. The plugin transcribes it out of band and POSTs the result; for the example payload above the handler prints:
Transcript for 3EB0C767D26A1D8F5A2B: Hey, are we still on for tomorrow?
To reply in WhatsApp once you have the text, send a message back through the core API — for example POST http://localhost:2785/api/sessions/my-session/messages/send-text with the X-API-Key: YOUR_API_KEY header. See Sending messages.
Verify the delivery signature
The deliveryWebhookUrl is a plain endpoint anyone could POST to, so treat its body as untrusted. When you set deliverySecret in the plugin config, the plugin signs each POST body with HMAC-SHA256 and sends the digest in the X-OpenWA-Signature: sha256=<hex> header — the same scheme as OpenWA core webhooks. Recompute the HMAC over the raw request body with your secret and compare in constant time (as shown above); reject with 401 on a mismatch. This proves the delivery came from OpenWA, but the transcript text itself is still attacker-controlled speech — keep treating it as untrusted input.
A transcript is attacker-controlled free text — a sender can speak instructions a victim would never type. The payload is marked "untrusted": true with "source": "speech-to-text" for exactly this reason. If you feed transcription.text into an LLM, pass it as user-role input, never as a system instruction, to avoid prompt injection. The plugin removes the human-in-the-loop, so this is a real new injection surface — handle it deliberately.
Behavior and limits
| Situation | What the plugin does |
|---|---|
| STT times out or errors | Fails open: skips and logs. The original message was already delivered; you get no transcript for that note. |
| Empty or non-speech note | Empty transcript is not delivered — no event is sent. |
| Oversized audio | Notes above maxSizeBytes, or whose audio was dropped for exceeding the inbound media cap, are skipped. |
| Rate limit reached | Notes beyond maxPerHour for the session are skipped (best-effort cost guard). |
Non-voice message | Skipped unless its type is in enabledMessageTypes. |
| Duplicate engine re-fire | Best-effort de-duplication; on rare simultaneous re-fires a note may transcribe once more. Keep your handler idempotent. |
Because de-duplication is best-effort, design your receiver so that processing the same messageId twice is harmless — for example, key any downstream action on messageId.
Troubleshooting
| Symptom | Likely cause | Fix |
|---|---|---|
| No transcript ever arrives | Plugin disabled, or message type not in enabledMessageTypes | Confirm the plugin is enabled and that the inbound type (default voice) is configured. |
| STT call fails with a connection error | sttBaseUrl host not in net.allow, or loopback blocked by the SSRF guard, or provider unreachable | Add the host to net.allow and re-package if it is not loopback/Groq/OpenAI. For a loopback provider, use a literal IP (http://127.0.0.1:<port>) and set SSRF_ALLOWED_HOSTS=127.0.0.1,localhost on the OpenWA host. Verify the STT server is running and reachable. |
| Transcription runs but the delivery never arrives | deliveryWebhookUrl host not in net.allow | Add the webhook host to the plugin's net.allow, re-package, and re-install. The delivery POST is dropped silently otherwise. |
Receiver returns 401 to OpenWA | deliverySecret mismatch, or HMAC computed over the parsed body instead of the raw bytes | Use the exact deliverySecret value, and recompute the HMAC over the raw request body, not the re-serialized JSON. |
| Provider returns a 4xx on the audio | Audio part not labeled as OGG | The plugin uploads the part as voice.ogg with audio/ogg; if you customized this, restore those values — OpenAI-compatible servers reject a bare .opus filename. |
| Transcript text looks empty or wrong | Non-speech audio, or wrong language assumed | Empty transcripts are dropped by design; for language issues, confirm your provider auto-detects or set the model/language your provider expects. |
| Spend higher than expected | Cost guards unset or too loose | Set maxSizeBytes and maxPerHour; prefer a self-hosted provider for high volume. |
Next steps
- Plugins — install, configure, and manage the voice-transcription plugin.
- Sending messages — reply in WhatsApp once you have the transcribed text.
- Webhooks — subscribe to core events such as
message.received(distinct from the plugin's delivery webhook).