How Dictation Works

TalkWriter transforms your voice into polished, ready-to-use text through a multi-stage pipeline. Understanding each step helps you get better results and troubleshoot issues when they arise.

Why a Pipeline (Not Just Transcription)?

Basic dictation tools do one thing: convert audio to text. The result is a wall of unformatted words with no punctuation, no capitalization, and every "um" and "uh" included. You spend as long editing as you would have typing.

TalkWriter uses a four-stage pipeline because each stage solves a different problem. Splitting the work across specialized systems -- a speech engine optimized for accuracy, an AI model optimized for language cleanup -- produces results that no single system could match.

The Dictation Pipeline

When you dictate, your voice passes through four stages:

Voice Input --> Speech-to-Text --> AI Polish --> Paste

Stage 1: Voice Input

What happens: Your microphone captures your voice and TalkWriter streams the audio data to the cloud in real time.

Your Mac's built-in microphone, an external USB mic, or Bluetooth headset captures audio.
TalkWriter streams audio as you speak. It does not wait until you finish -- this is what keeps latency low.
The pill overlay shows an animated waveform to confirm audio is being detected.

tip

For the best results, speak clearly and keep your microphone 6-12 inches from your mouth. An external USB microphone or headset makes a bigger difference than any software setting.

Stage 2: Speech-to-Text (Soniox STT)

What happens: A professional-grade speech recognition engine (Soniox) converts your audio stream into raw text.

Soniox processes your audio in real time with low latency (~200ms).
It supports 100+ languages and handles accents, fast speech, and technical vocabulary.
The raw output is unformatted: no punctuation, no capitalization, and filler words are included.

Example raw output:

hey um i wanted to follow up on our meeting from yesterday i think the project timeline looks good but uh we might need to push the design review back a week

Stage 3: AI Polish

What happens: TalkWriter's AI engine (powered by Claude) cleans up the raw transcription and produces natural, well-formatted text.

AI Polish performs these transformations:

Transformation	Before	After
Remove filler words	"um", "uh", "like", "you know"	Removed
Add punctuation	"hello how are you"	"Hello, how are you?"
Fix capitalization	"i went to new york"	"I went to New York"
Format numbers	"twenty five dollars"	"$25"
Clean sentence structure	"so basically the thing is that"	Direct phrasing

Example polished output:

Hey, I wanted to follow up on our meeting from yesterday. I think the project timeline looks good, but we might need to push the design review back a week.

Pro Feature

TalkTone adds an extra layer after AI Polish. If you have Pro, your text is rewritten to match a selected writing style (Professional, Casual, Academic, etc.) with your chosen formatting and intensity. Learn about TalkTone

Stage 4: Paste

What happens: The polished text is inserted at your cursor position in whatever app is active.

TalkWriter uses macOS Accessibility to simulate a clipboard paste action.
Text appears wherever your cursor was when you started dictating.
The pill overlay briefly shows a checkmark to confirm the paste succeeded.

Pipeline Summary

Stage	Engine	Where It Runs	Speed
Voice Input	Your microphone	Locally on your Mac	Instant
Speech-to-Text	Soniox (cloud)	Cloud servers	~200ms latency
AI Polish	Claude AI (cloud)	Cloud servers	~500ms-1s
Paste	macOS Accessibility	Locally on your Mac	Instant

note

Total time from releasing the Fn key to seeing text: typically under 2 seconds for short dictations. Longer passages may take slightly more time for AI processing.

Practical Example: Pipeline in Real Time

Scenario: You are in Google Docs writing a project update.

You hold Fn and say: "so the backend migration is about seventy percent done and we should be finished by end of next week assuming no blockers come up"
Stage 1 (Voice Input): Your mic streams audio to the cloud. The pill shows a waveform.
Stage 2 (Soniox): Raw text is generated: "so the backend migration is about seventy percent done and we should be finished by end of next week assuming no blockers come up"
Stage 3 (AI Polish): The AI cleans it up: "The backend migration is about 70% complete. We should be finished by end of next week, assuming no blockers come up."
Stage 4 (Paste): The polished text appears in Google Docs at your cursor. The pill shows a checkmark.

Total elapsed time: ~1.5 seconds.

Pro Tip

You can skip AI Polish entirely by toggling it off in Settings > AI Polish. This gives you raw Soniox transcription with no cleanup -- useful when you want to see exactly what the speech engine heard, or when you are dictating in a language where AI Polish adds less value.

Frequently Asked Questions

Can I skip AI Polish and get raw transcription? Yes. Toggle AI Polish off in Settings > AI Polish. You get the unformatted Soniox output directly.

Is my audio stored on the server? Audio is streamed for real-time processing and is not permanently stored. See our privacy policy for details.

Why does TalkWriter need the internet? Both the speech-to-text engine (Soniox) and AI Polish (Claude) run in the cloud. Cloud models are significantly more accurate than on-device alternatives, which is why TalkWriter requires an internet connection for all dictation.

What happens if my internet drops mid-dictation? TalkWriter shows an error on the pill overlay. Any audio captured before the disconnection may still be processed, but results are not guaranteed.

Was this helpful? Let us know at support@talkwriter.ai

Why a Pipeline (Not Just Transcription)?​

The Dictation Pipeline​

Stage 1: Voice Input​

Stage 2: Speech-to-Text (Soniox STT)​

Stage 3: AI Polish​

Stage 4: Paste​

Pipeline Summary​

Practical Example: Pipeline in Real Time​

Frequently Asked Questions​