Skip to main content
Version: v0.7.6

Transcribe incoming voice notes

Turn incoming WhatsApp voice notes into text so a bot or AI assistant can read and reply to them. The raw message.received event carries only the audio — never any words — so without transcription a downstream consumer has nothing to act on. This guide shows you how to enable the voice-transcription plugin, point it at a speech-to-text (STT) provider, and receive the transcript on a webhook your code controls.

Prerequisites
  • A running OpenWA instance (v0.7.6) with at least one connected session.
  • An OpenAI-compatible STT endpoint — self-hosted (for example faster-whisper behind an OpenAI-compatible server) or hosted (for example a Groq Whisper endpoint). See Choose an STT provider.
  • An HTTP endpoint your bot controls to receive the transcript (the deliveryWebhookUrl).
  • Familiarity with installing plugins.
Optional and off by default

Transcription is an optional marketplace plugin, not a built-in feature, and not a core API endpoint. Nothing transcribes until you install and configure the plugin. The OpenWA core API and the API Reference contain no transcription routes.

How it works

The plugin hooks each incoming message but never blocks delivery. It acknowledges the message immediately, then runs the STT call out of band and POSTs the result to your delivery webhook as a separate message.transcription payload. Your code joins that payload back to the original message by its messageId.

Because transcription is delivered out of band, it arrives after message.received and may not arrive at all (see Behavior and limits). Treat it as an enrichment that joins the original message, never as a replacement for it.

Enable the plugin

Install the voice-transcription plugin (see Plugins), then configure it. These are the settings you will set:

SettingRequiredPurpose
sttBaseUrlYesThe STT provider base URL exposing an OpenAI-compatible /v1/audio/transcriptions endpoint. Its host must also be allow-listed in the plugin's net.allow (see Allow the STT and webhook hosts).
sttApiKeyNoThe provider API key, stored as a secret. Leave empty for a local provider that needs none.
modelNoThe transcription model (a Whisper model such as small). Defaults to small.
languageNoOptional BCP-47 language hint (for example es). Blank auto-detects, and the detected value appears as transcription.language in the payload.
providerNoInformational label recorded in the delivered event as "provider". Defaults to faster-whisper.
deliveryWebhookUrlCond.The HTTP endpoint the transcript is POSTed to when it completes. Its host must also be in the plugin's net.allow. Optional if you only use chatDelivery.
deliverySecretNoOptional shared secret, stored redacted. When set, the plugin HMAC-SHA256 signs the POST body in X-OpenWA-Signature: sha256=<hex> so you can verify the delivery is genuinely from OpenWA (see Verify the delivery signature).
chatDeliveryNoAlso post the transcript into WhatsApp: off (webhook only, the default), self (a note to your own number), or reply (a quote-reply to the sender, visible to them).
enabledMessageTypesNoWhich message types to transcribe. Defaults to ["voice"] (PTT). Add audio to also transcribe non-PTT audio.
timeoutMsNoPer-request STT timeout in milliseconds. Defaults to 20000 (max 30000).
maxSizeBytesNoUpper bound on audio size in bytes, used as a cost and abuse guard. Defaults to 16777216 (16 MiB).
maxPerHourNoBest-effort per-session hourly cap, another cost guard. Defaults to 60.
deliveryTimeoutMsNoDelivery POST timeout in milliseconds. Defaults to 5000 (max 30000).
The delivery webhook is not a core webhook

deliveryWebhookUrl is a plain URL the plugin POSTs to directly. It is separate from the core webhook subscriptions you create with POST /api/sessions/{sessionId}/webhooks. The core webhook event list does not include message.transcription — that payload only ever reaches the deliveryWebhookUrl you configure on the plugin. Do not try to subscribe to it through the core webhooks API.

Allow the STT and webhook hosts

The plugin makes outbound calls — to sttBaseUrl and to deliveryWebhookUrl — through OpenWA's SSRF-guarded fetch, which reaches only hosts listed in the plugin's manifest net.allow. The shipped manifest allows just four:

"net": { "allow": ["localhost", "127.0.0.1", "api.groq.com:443", "api.openai.com:443"] }

If your STT host or your deliveryWebhookUrl host is anything else — your own bot server, an n8n instance on another box, a non-loopback STT server — the call fails silently (no transcript or no delivery, only a log line) until you fix it. To allow another host:

  1. Add host:port to net.allow in the plugin's manifest.json.
  2. Re-package the plugin (node package.mjs voice-transcription) and re-install the updated zip.
A delivery webhook on your own server needs net.allow first

Pointing deliveryWebhookUrl at your own server (any host other than localhost/127.0.0.1) without adding it to net.allow is a silent dead end: transcription runs, but the delivery POST never leaves the host. Add the host and re-package before you expect deliveries.

Choose an STT provider

The plugin speaks the OpenAI-compatible /v1/audio/transcriptions contract, so you switch backends by changing sttBaseUrl (and sttApiKey) — no other change:

  • Self-hosted (local-first). A local Whisper server such as faster-whisper served through an OpenAI-compatible API. It runs at no per-minute cost and ingests WhatsApp's OGG/Opus audio directly, with no transcoding step.
  • Hosted. A hosted Whisper-compatible API (for example a Groq Whisper endpoint). Faster to onboard; billed per minute, so set maxSizeBytes and maxPerHour deliberately. api.groq.com:443 and api.openai.com:443 are already in the plugin's net.allow; any other hosted endpoint must be added (see Allow the STT and webhook hosts).
Target a local provider by literal IP — and allow loopback

If your STT server runs on the same host, set sttBaseUrl to a literal IP (for example http://127.0.0.1:8000) rather than a hostname. A literal IP avoids the DNS-rebinding gaps that allow-listed hostnames leave open. The IP alone is not enough, though: OpenWA's SSRF guard blocks loopback by default, so you must also set SSRF_ALLOWED_HOSTS on the OpenWA host — for example SSRF_ALLOWED_HOSTS=127.0.0.1,localhost. Without it, the call to a loopback STT server is rejected even though 127.0.0.1 is in net.allow.

Receive the transcript

When transcription completes, the plugin POSTs a message.transcription payload to your deliveryWebhookUrl. The transcript is never stitched into the original message body — you read it from this event:

{
"event": "message.transcription",
"sessionId": "my-session",
"messageId": "3EB0C767D26A1D8F5A2B",
"chatId": "6281234567890@s.whatsapp.net",
"status": "completed",
"source": "speech-to-text",
"untrusted": true,
"transcription": {
"text": "Hey, are we still on for tomorrow?",
"language": "en",
"provider": "faster-whisper",
"model": "small"
}
}

The chatId suffix is engine-dependent: OpenWA's default Baileys engine uses @s.whatsapp.net (shown above), while the whatsapp-web.js engine uses @c.us. Match on messageId, not on the suffix.

status is one of completed, failed, or skipped. The transcription object is present only when status is completed. For failed and skipped, the payload instead carries a reason string:

statusreason valuesMeaning
skippedtoo_largeAudio exceeded maxSizeBytes.
skippedrate_limitedPer-session maxPerHour cap reached.
skippedemptySTT returned a blank transcript (non-speech audio).
skippedmedia_unavailableThe inbound audio was not available to transcribe.
failed(STT error message)The STT call errored; reason holds the error text.

Join the payload to the original message on messageId, and do not assume it arrives in any particular order relative to message.received.

A minimal receiver in Express. It captures the raw body so it can verify the HMAC signature (see Verify the delivery signature) before parsing:

import express from "express";
import { createHmac, timingSafeEqual } from "node:crypto";

const DELIVERY_SECRET = process.env.DELIVERY_SECRET ?? ""; // same value as the plugin's `deliverySecret`

const app = express();
app.use(express.json({ verify: (req, _res, buf) => { (req as any).rawBody = buf; } }));

app.post("/transcription", (req, res) => {
// Verify the delivery is genuinely from OpenWA when a secret is configured.
if (DELIVERY_SECRET) {
const expected = `sha256=${createHmac("sha256", DELIVERY_SECRET).update((req as any).rawBody).digest("hex")}`;
const sent = String(req.header("X-OpenWA-Signature") ?? "");
const ok =
sent.length === expected.length &&
timingSafeEqual(Buffer.from(sent), Buffer.from(expected));
if (!ok) return res.sendStatus(401);
}

const { messageId, status, reason, transcription } = req.body;
if (status === "completed") {
// transcription.text is user-role input — never a system instruction.
console.log(`Transcript for ${messageId}: ${transcription.text}`);
} else {
console.log(`Transcription ${status} for ${messageId}: ${reason}`);
}
res.sendStatus(200); // ack fast; do heavy work async
});

app.listen(3000, () => console.log("listening on :3000"));

To trigger it, send a WhatsApp voice note to the connected session from another phone. The plugin transcribes it out of band and POSTs the result; for the example payload above the handler prints:

Transcript for 3EB0C767D26A1D8F5A2B: Hey, are we still on for tomorrow?

To reply in WhatsApp once you have the text, send a message back through the core API — for example POST http://localhost:2785/api/sessions/my-session/messages/send-text with the X-API-Key: YOUR_API_KEY header. See Sending messages.

Verify the delivery signature

The deliveryWebhookUrl is a plain endpoint anyone could POST to, so treat its body as untrusted. When you set deliverySecret in the plugin config, the plugin signs each POST body with HMAC-SHA256 and sends the digest in the X-OpenWA-Signature: sha256=<hex> header — the same scheme as OpenWA core webhooks. Recompute the HMAC over the raw request body with your secret and compare in constant time (as shown above); reject with 401 on a mismatch. This proves the delivery came from OpenWA, but the transcript text itself is still attacker-controlled speech — keep treating it as untrusted input.

Treat transcribed text as untrusted input

A transcript is attacker-controlled free text — a sender can speak instructions a victim would never type. The payload is marked "untrusted": true with "source": "speech-to-text" for exactly this reason. If you feed transcription.text into an LLM, pass it as user-role input, never as a system instruction, to avoid prompt injection. The plugin removes the human-in-the-loop, so this is a real new injection surface — handle it deliberately.

Behavior and limits

SituationWhat the plugin does
STT times out or errorsFails open: skips and logs. The original message was already delivered; you get no transcript for that note.
Empty or non-speech noteEmpty transcript is not delivered — no event is sent.
Oversized audioNotes above maxSizeBytes, or whose audio was dropped for exceeding the inbound media cap, are skipped.
Rate limit reachedNotes beyond maxPerHour for the session are skipped (best-effort cost guard).
Non-voice messageSkipped unless its type is in enabledMessageTypes.
Duplicate engine re-fireBest-effort de-duplication; on rare simultaneous re-fires a note may transcribe once more. Keep your handler idempotent.
Keep your handler idempotent

Because de-duplication is best-effort, design your receiver so that processing the same messageId twice is harmless — for example, key any downstream action on messageId.

Troubleshooting

SymptomLikely causeFix
No transcript ever arrivesPlugin disabled, or message type not in enabledMessageTypesConfirm the plugin is enabled and that the inbound type (default voice) is configured.
STT call fails with a connection errorsttBaseUrl host not in net.allow, or loopback blocked by the SSRF guard, or provider unreachableAdd the host to net.allow and re-package if it is not loopback/Groq/OpenAI. For a loopback provider, use a literal IP (http://127.0.0.1:<port>) and set SSRF_ALLOWED_HOSTS=127.0.0.1,localhost on the OpenWA host. Verify the STT server is running and reachable.
Transcription runs but the delivery never arrivesdeliveryWebhookUrl host not in net.allowAdd the webhook host to the plugin's net.allow, re-package, and re-install. The delivery POST is dropped silently otherwise.
Receiver returns 401 to OpenWAdeliverySecret mismatch, or HMAC computed over the parsed body instead of the raw bytesUse the exact deliverySecret value, and recompute the HMAC over the raw request body, not the re-serialized JSON.
Provider returns a 4xx on the audioAudio part not labeled as OGGThe plugin uploads the part as voice.ogg with audio/ogg; if you customized this, restore those values — OpenAI-compatible servers reject a bare .opus filename.
Transcript text looks empty or wrongNon-speech audio, or wrong language assumedEmpty transcripts are dropped by design; for language issues, confirm your provider auto-detects or set the model/language your provider expects.
Spend higher than expectedCost guards unset or too looseSet maxSizeBytes and maxPerHour; prefer a self-hosted provider for high volume.

Next steps

  • Plugins — install, configure, and manage the voice-transcription plugin.
  • Sending messages — reply in WhatsApp once you have the transcribed text.
  • Webhooks — subscribe to core events such as message.received (distinct from the plugin's delivery webhook).