How Dictation Works
TalkWriter transforms your voice into polished, ready-to-use text through a multi-stage pipeline. Understanding each step helps you get better results and troubleshoot issues when they arise.
Why a Pipeline (Not Just Transcription)?
Basic dictation tools do one thing: convert audio to text. The result is a wall of unformatted words with no punctuation, no capitalization, and every "um" and "uh" included. You spend as long editing as you would have typing.
TalkWriter uses a four-stage pipeline because each stage solves a different problem. Splitting the work across specialized systems -- a speech engine optimized for accuracy, an AI model optimized for language cleanup -- produces results that no single system could match.
The Dictation Pipeline
When you dictate, your voice passes through four stages:
Voice Input --> Speech-to-Text --> AI Polish --> Paste
Stage 1: Voice Input
What happens: Your microphone captures your voice and TalkWriter streams the audio data to the cloud in real time.
- Your Mac's built-in microphone, an external USB mic, or Bluetooth headset captures audio.
- TalkWriter streams audio as you speak. It does not wait until you finish -- this is what keeps latency low.
- The pill overlay shows an animated waveform to confirm audio is being detected.
For the best results, speak clearly and keep your microphone 6-12 inches from your mouth. An external USB microphone or headset makes a bigger difference than any software setting.
Stage 2: Speech-to-Text (Soniox STT)
What happens: A professional-grade speech recognition engine (Soniox) converts your audio stream into raw text.
- Soniox processes your audio in real time with low latency (~200ms).
- It supports 100+ languages and handles accents, fast speech, and technical vocabulary.
- The raw output is unformatted: no punctuation, no capitalization, and filler words are included.
Example raw output:
hey um i wanted to follow up on our meeting from yesterday i think the project timeline looks good but uh we might need to push the design review back a week
Stage 3: AI Polish
What happens: TalkWriter's AI engine (powered by Claude) cleans up the raw transcription and produces natural, well-formatted text.
AI Polish performs these transformations:
| Transformation | Before | After |
|---|---|---|
| Remove filler words | "um", "uh", "like", "you know" | Removed |
| Add punctuation | "hello how are you" | "Hello, how are you?" |
| Fix capitalization | "i went to new york" | "I went to New York" |
| Format numbers | "twenty five dollars" | "$25" |
| Clean sentence structure | "so basically the thing is that" | Direct phrasing |
Example polished output:
Hey, I wanted to follow up on our meeting from yesterday. I think the project timeline looks good, but we might need to push the design review back a week.
TalkTone adds an extra layer after AI Polish. If you have Pro, your text is rewritten to match a selected writing style (Professional, Casual, Academic, etc.) with your chosen formatting and intensity. Learn about TalkTone
Stage 4: Paste
What happens: The polished text is inserted at your cursor position in whatever app is active.
- TalkWriter uses macOS Accessibility to simulate a clipboard paste action.
- Text appears wherever your cursor was when you started dictating.
- The pill overlay briefly shows a checkmark to confirm the paste succeeded.
Pipeline Summary
| Stage | Engine | Where It Runs | Speed |
|---|---|---|---|
| Voice Input | Your microphone | Locally on your Mac | Instant |
| Speech-to-Text | Soniox (cloud) | Cloud servers | ~200ms latency |
| AI Polish | Claude AI (cloud) | Cloud servers | ~500ms-1s |
| Paste | macOS Accessibility | Locally on your Mac | Instant |
Total time from releasing the Fn key to seeing text: typically under 2 seconds for short dictations. Longer passages may take slightly more time for AI processing.
Practical Example: Pipeline in Real Time
Scenario: You are in Google Docs writing a project update.
- You hold Fn and say: "so the backend migration is about seventy percent done and we should be finished by end of next week assuming no blockers come up"
- Stage 1 (Voice Input): Your mic streams audio to the cloud. The pill shows a waveform.
- Stage 2 (Soniox): Raw text is generated: "so the backend migration is about seventy percent done and we should be finished by end of next week assuming no blockers come up"
- Stage 3 (AI Polish): The AI cleans it up: "The backend migration is about 70% complete. We should be finished by end of next week, assuming no blockers come up."
- Stage 4 (Paste): The polished text appears in Google Docs at your cursor. The pill shows a checkmark.
Total elapsed time: ~1.5 seconds.
You can skip AI Polish entirely by toggling it off in Settings > AI Polish. This gives you raw Soniox transcription with no cleanup -- useful when you want to see exactly what the speech engine heard, or when you are dictating in a language where AI Polish adds less value.
Frequently Asked Questions
Can I skip AI Polish and get raw transcription? Yes. Toggle AI Polish off in Settings > AI Polish. You get the unformatted Soniox output directly.
Is my audio stored on the server? Audio is streamed for real-time processing and is not permanently stored. See our privacy policy for details.
Why does TalkWriter need the internet? Both the speech-to-text engine (Soniox) and AI Polish (Claude) run in the cloud. Cloud models are significantly more accurate than on-device alternatives, which is why TalkWriter requires an internet connection for all dictation.
What happens if my internet drops mid-dictation? TalkWriter shows an error on the pill overlay. Any audio captured before the disconnection may still be processed, but results are not guaranteed.
Was this helpful? Let us know at support@talkwriter.ai