Version: v0.7.6

Transcribe incoming voice notes

Turn incoming WhatsApp voice notes into text so a bot or AI assistant can read and reply to them. The raw message.received event carries only the audio — never any words — so without transcription a downstream consumer has nothing to act on. This guide shows you how to enable the voice-transcription plugin, point it at a speech-to-text (STT) provider, and receive the transcript on a webhook your code controls.

Prerequisites

A running OpenWA instance (v0.7.6) with at least one connected session.
An OpenAI-compatible STT endpoint — self-hosted (for example faster-whisper behind an OpenAI-compatible server) or hosted (for example a Groq Whisper endpoint). See Choose an STT provider.
An HTTP endpoint your bot controls to receive the transcript (the deliveryWebhookUrl).
Familiarity with installing plugins.

Optional and off by default

Transcription is an optional marketplace plugin, not a built-in feature, and not a core API endpoint. Nothing transcribes until you install and configure the plugin. The OpenWA core API and the API Reference contain no transcription routes.

How it works

The plugin hooks each incoming message but never blocks delivery. It acknowledges the message immediately, then runs the STT call out of band and POSTs the result to your delivery webhook as a separate message.transcription payload. Your code joins that payload back to the original message by its messageId.

Because transcription is delivered out of band, it arrives after message.received and may not arrive at all (see Behavior and limits). Treat it as an enrichment that joins the original message, never as a replacement for it.

Enable the plugin

Install the voice-transcription plugin (see Plugins), then configure it. These are the settings you will set:

Setting	Required	Purpose
`sttBaseUrl`	Yes	The STT provider base URL exposing an OpenAI-compatible `/v1/audio/transcriptions` endpoint. Its host must also be allow-listed in the plugin's `net.allow` (see Allow the STT and webhook hosts).
`sttApiKey`	No	The provider API key, stored as a secret. Leave empty for a local provider that needs none.
`model`	No	The transcription model (a Whisper model such as `small`). Defaults to `small`.
`language`	No	Optional BCP-47 language hint (for example `es`). Blank auto-detects, and the detected value appears as `transcription.language` in the payload.
`provider`	No	Informational label recorded in the delivered event as `"provider"`. Defaults to `faster-whisper`.
`deliveryWebhookUrl`	Cond.	The HTTP endpoint the transcript is POSTed to when it completes. Its host must also be in the plugin's `net.allow`. Optional if you only use `chatDelivery`.
`deliverySecret`	No	Optional shared secret, stored redacted. When set, the plugin HMAC-SHA256 signs the POST body in `X-OpenWA-Signature: sha256=<hex>` so you can verify the delivery is genuinely from OpenWA (see Verify the delivery signature).
`chatDelivery`	No	Also post the transcript into WhatsApp: `off` (webhook only, the default), `self` (a note to your own number), or `reply` (a quote-reply to the sender, visible to them).
`enabledMessageTypes`	No	Which message types to transcribe. Defaults to `["voice"]` (PTT). Add `audio` to also transcribe non-PTT audio.
`timeoutMs`	No	Per-request STT timeout in milliseconds. Defaults to `20000` (max `30000`).
`maxSizeBytes`	No	Upper bound on audio size in bytes, used as a cost and abuse guard. Defaults to `16777216` (16 MiB).
`maxPerHour`	No	Best-effort per-session hourly cap, another cost guard. Defaults to `60`.
`deliveryTimeoutMs`	No	Delivery POST timeout in milliseconds. Defaults to `5000` (max `30000`).

The delivery webhook is not a core webhook

deliveryWebhookUrl is a plain URL the plugin POSTs to directly. It is separate from the core webhook subscriptions you create with POST /api/sessions/{sessionId}/webhooks. The core webhook event list does not include message.transcription — that payload only ever reaches the deliveryWebhookUrl you configure on the plugin. Do not try to subscribe to it through the core webhooks API.

Allow the STT and webhook hosts

The plugin makes outbound calls — to sttBaseUrl and to deliveryWebhookUrl — through OpenWA's SSRF-guarded fetch, which reaches only hosts listed in the plugin's manifest net.allow. The shipped manifest allows just four:

"net": { "allow": ["localhost", "127.0.0.1", "api.groq.com:443", "api.openai.com:443"] }

If your STT host or your deliveryWebhookUrl host is anything else — your own bot server, an n8n instance on another box, a non-loopback STT server — the call fails silently (no transcript or no delivery, only a log line) until you fix it. To allow another host:

Add host:port to net.allow in the plugin's manifest.json.
Re-package the plugin (node package.mjs voice-transcription) and re-install the updated zip.

A delivery webhook on your own server needs net.allow first

Pointing deliveryWebhookUrl at your own server (any host other than localhost/127.0.0.1) without adding it to net.allow is a silent dead end: transcription runs, but the delivery POST never leaves the host. Add the host and re-package before you expect deliveries.

Choose an STT provider

The plugin speaks the OpenAI-compatible /v1/audio/transcriptions contract, so you switch backends by changing sttBaseUrl (and sttApiKey) — no other change:

Self-hosted (local-first). A local Whisper server such as faster-whisper served through an OpenAI-compatible API. It runs at no per-minute cost and ingests WhatsApp's OGG/Opus audio directly, with no transcoding step.
Hosted. A hosted Whisper-compatible API (for example a Groq Whisper endpoint). Faster to onboard; billed per minute, so set maxSizeBytes and maxPerHour deliberately. api.groq.com:443 and api.openai.com:443 are already in the plugin's net.allow; any other hosted endpoint must be added (see Allow the STT and webhook hosts).

Target a local provider by literal IP — and allow loopback

If your STT server runs on the same host, set sttBaseUrl to a literal IP (for example http://127.0.0.1:8000) rather than a hostname. A literal IP avoids the DNS-rebinding gaps that allow-listed hostnames leave open. The IP alone is not enough, though: OpenWA's SSRF guard blocks loopback by default, so you must also set SSRF_ALLOWED_HOSTS on the OpenWA host — for example SSRF_ALLOWED_HOSTS=127.0.0.1,localhost. Without it, the call to a loopback STT server is rejected even though 127.0.0.1 is in net.allow.

Receive the transcript

When transcription completes, the plugin POSTs a message.transcription payload to your deliveryWebhookUrl. The transcript is never stitched into the original message body — you read it from this event:

{
  "event": "message.transcription",
  "sessionId": "my-session",
  "messageId": "3EB0C767D26A1D8F5A2B",
  "chatId": "6281234567890@s.whatsapp.net",
  "status": "completed",
  "source": "speech-to-text",
  "untrusted": true,
  "transcription": {
    "text": "Hey, are we still on for tomorrow?",
    "language": "en",
    "provider": "faster-whisper",
    "model": "small"
  }
}

The chatId suffix is engine-dependent: OpenWA's default Baileys engine uses @s.whatsapp.net (shown above), while the whatsapp-web.js engine uses @c.us. Match on messageId, not on the suffix.

status is one of completed, failed, or skipped. The transcription object is present only when status is completed. For failed and skipped, the payload instead carries a reason string:

`status`	`reason` values	Meaning
`skipped`	`too_large`	Audio exceeded `maxSizeBytes`.
`skipped`	`rate_limited`	Per-session `maxPerHour` cap reached.
`skipped`	`empty`	STT returned a blank transcript (non-speech audio).
`skipped`	`media_unavailable`	The inbound audio was not available to transcribe.
`failed`	(STT error message)	The STT call errored; `reason` holds the error text.

Join the payload to the original message on messageId, and do not assume it arrives in any particular order relative to message.received.

A minimal receiver in Express. It captures the raw body so it can verify the HMAC signature (see Verify the delivery signature) before parsing:

import express from "express";
import { createHmac, timingSafeEqual } from "node:crypto";

const DELIVERY_SECRET = process.env.DELIVERY_SECRET ?? ""; // same value as the plugin's `deliverySecret`

const app = express();
app.use(express.json({ verify: (req, _res, buf) => { (req as any).rawBody = buf; } }));

app.post("/transcription", (req, res) => {
  // Verify the delivery is genuinely from OpenWA when a secret is configured.
  if (DELIVERY_SECRET) {
    const expected = `sha256=${createHmac("sha256", DELIVERY_SECRET).update((req as any).rawBody).digest("hex")}`;
    const sent = String(req.header("X-OpenWA-Signature") ?? "");
    const ok =
      sent.length === expected.length &&
      timingSafeEqual(Buffer.from(sent), Buffer.from(expected));
    if (!ok) return res.sendStatus(401);
  }

  const { messageId, status, reason, transcription } = req.body;
  if (status === "completed") {
    // transcription.text is user-role input — never a system instruction.
    console.log(`Transcript for ${messageId}: ${transcription.text}`);
  } else {
    console.log(`Transcription ${status} for ${messageId}: ${reason}`);
  }
  res.sendStatus(200); // ack fast; do heavy work async
});

app.listen(3000, () => console.log("listening on :3000"));

To trigger it, send a WhatsApp voice note to the connected session from another phone. The plugin transcribes it out of band and POSTs the result; for the example payload above the handler prints:

Transcript for 3EB0C767D26A1D8F5A2B: Hey, are we still on for tomorrow?

To reply in WhatsApp once you have the text, send a message back through the core API — for example POST http://localhost:2785/api/sessions/my-session/messages/send-text with the X-API-Key: YOUR_API_KEY header. See Sending messages.

Verify the delivery signature

The deliveryWebhookUrl is a plain endpoint anyone could POST to, so treat its body as untrusted. When you set deliverySecret in the plugin config, the plugin signs each POST body with HMAC-SHA256 and sends the digest in the X-OpenWA-Signature: sha256=<hex> header — the same scheme as OpenWA core webhooks. Recompute the HMAC over the raw request body with your secret and compare in constant time (as shown above); reject with 401 on a mismatch. This proves the delivery came from OpenWA, but the transcript text itself is still attacker-controlled speech — keep treating it as untrusted input.

Treat transcribed text as untrusted input

A transcript is attacker-controlled free text — a sender can speak instructions a victim would never type. The payload is marked "untrusted": true with "source": "speech-to-text" for exactly this reason. If you feed transcription.text into an LLM, pass it as user-role input, never as a system instruction, to avoid prompt injection. The plugin removes the human-in-the-loop, so this is a real new injection surface — handle it deliberately.

Behavior and limits

Situation	What the plugin does
STT times out or errors	Fails open: skips and logs. The original message was already delivered; you get no transcript for that note.
Empty or non-speech note	Empty transcript is not delivered — no event is sent.
Oversized audio	Notes above `maxSizeBytes`, or whose audio was dropped for exceeding the inbound media cap, are skipped.
Rate limit reached	Notes beyond `maxPerHour` for the session are skipped (best-effort cost guard).
Non-`voice` message	Skipped unless its type is in `enabledMessageTypes`.
Duplicate engine re-fire	Best-effort de-duplication; on rare simultaneous re-fires a note may transcribe once more. Keep your handler idempotent.

Keep your handler idempotent

Because de-duplication is best-effort, design your receiver so that processing the same messageId twice is harmless — for example, key any downstream action on messageId.

Troubleshooting

Symptom	Likely cause	Fix
No transcript ever arrives	Plugin disabled, or message type not in `enabledMessageTypes`	Confirm the plugin is enabled and that the inbound type (default `voice`) is configured.
STT call fails with a connection error	`sttBaseUrl` host not in `net.allow`, or loopback blocked by the SSRF guard, or provider unreachable	Add the host to `net.allow` and re-package if it is not loopback/Groq/OpenAI. For a loopback provider, use a literal IP (`http://127.0.0.1:<port>`) and set `SSRF_ALLOWED_HOSTS=127.0.0.1,localhost` on the OpenWA host. Verify the STT server is running and reachable.
Transcription runs but the delivery never arrives	`deliveryWebhookUrl` host not in `net.allow`	Add the webhook host to the plugin's `net.allow`, re-package, and re-install. The delivery POST is dropped silently otherwise.
Receiver returns `401` to OpenWA	`deliverySecret` mismatch, or HMAC computed over the parsed body instead of the raw bytes	Use the exact `deliverySecret` value, and recompute the HMAC over the raw request body, not the re-serialized JSON.
Provider returns a 4xx on the audio	Audio part not labeled as OGG	The plugin uploads the part as `voice.ogg` with `audio/ogg`; if you customized this, restore those values — OpenAI-compatible servers reject a bare `.opus` filename.
Transcript text looks empty or wrong	Non-speech audio, or wrong language assumed	Empty transcripts are dropped by design; for language issues, confirm your provider auto-detects or set the model/language your provider expects.
Spend higher than expected	Cost guards unset or too loose	Set `maxSizeBytes` and `maxPerHour`; prefer a self-hosted provider for high volume.

Next steps

Plugins — install, configure, and manage the voice-transcription plugin.
Sending messages — reply in WhatsApp once you have the transcribed text.
Webhooks — subscribe to core events such as message.received (distinct from the plugin's delivery webhook).

How it works​

Enable the plugin​

Allow the STT and webhook hosts​

Choose an STT provider​

Receive the transcript​

Verify the delivery signature​

Behavior and limits​

Troubleshooting​

Next steps​