A transcription service converts spoken audio or video into written text. Professional services handle the full workflow: audio processing, speech-to-text conversion, accuracy review, formatting, and secure delivery.
Unlike consumer speech-to-text tools, professional transcription services achieve 98–99% accuracy, identify multiple speakers, handle specialist terminology, and deliver structured transcripts suitable for legal, medical, research, or business documentation.
Many organisations begin with internal transcription using tools such as Zoom auto-captions, Otter, or Whisper. These systems can work for quick internal notes, but typical accuracy levels of 80–90% quickly become problematic at scale.
Multi-speaker meetings, strong accents, background noise, and specialist terminology significantly increase error rates. Correcting low-quality transcripts often takes more time than commissioning professional transcription from the start.
This guide explains what a transcription service includes, the types of transcription available, how AI and human transcription differ, pricing models used across the industry, quality and accuracy standards, and when organisations should use professional transcription rather than automated tools.
Transcription Service Definition: What It Is, What It Converts, and What It Delivers
A transcription service converts spoken audio or video recordings into an accurate, structured written document. Professional transcription goes beyond simple speech-to-text conversion by adding speaker identification, timestamps, formatting, and quality checks.
These elements ensure the transcript is reliable for business, legal, research, or medical use and distinguish professional services from consumer auto-captioning tools, as Robert C. Moore 2003 explains that “accurate transcription often requires human review and structured annotation beyond automatic speech recognition output.
Transcription vs Translation vs Captioning: Three Related but Distinct Services Clarified

Three related services are often confused in procurement decisions: transcription, translation, and captioning.
Transcription converts spoken audio or video into written text in the same language. For example, an English meeting recording becomes an English transcript. Organisations use transcription for documentation, compliance records, research analysis, and searchable archives.
Translation converts written or spoken content from one language into another. For example, an English document translated into French. Translation supports multilingual communication and international operations.
Captioning displays synchronised text on video. Instead of a plain transcript, captioning produces a timed caption file (such as SRT or VTT) used for accessibility, streaming platforms, and eLearning video.
In some workflows, the services overlap. Audio translation combines transcription and translation in one process. Subtitles add timing to translated transcripts, while SDH subtitles include accessibility cues such as speaker identification and sound effects.
For B2B teams: if you need written text in the same language, commission transcription. If you need another language, use audio translation. If the text must appear on video, request captioning or subtitles.
What a Professional Transcription Service Delivers: The Full Scope Beyond “Converting Audio to Text”
A professional transcript includes several quality and formatting elements that consumer speech-to-text tools usually do not provide.
First, the spoken content is captured with high accuracy, typically 98–99%+ for human-reviewed transcription. Specialist terminology, proper nouns, and technical vocabulary are verified rather than guessed.
Second, speaker identification (diarisation) labels each participant, which is essential for interviews, focus groups, depositions, and panel discussions.
Third, timestamps are inserted at speaker changes or intervals, allowing readers to quickly locate moments in the original recording.
Professional transcripts may also include non-verbal cues such as [laughter], [pause], or [inaudible at 00:04:32]. Formatting is applied according to client specifications, including paragraph structure and style conventions.
Before delivery, a proofreading or QA pass checks the transcript against the audio to ensure accuracy and consistency. Files are then delivered through secure channels, typically protected by NDA and encrypted transfer.
Auto-transcription tools generally lack these steps. They rarely guarantee terminology accuracy, consistent speaker labelling, structured formatting, or independent quality review.
The Transcription Process: Step-by-Step from Audio File Submission to Final Transcript Delivery

Professional transcription follows a structured workflow that combines automated tools and human review to achieve high accuracy.
Step 1 — File submission
The client uploads audio or video files through a secure portal or sends a link. Common formats include MP3, WAV, MP4, M4A, MOV, FLAC, and AIFF. Clients may also provide speaker names, terminology lists, or style instructions.
Step 2 — Automated speech recognition (ASR)
Speech-to-text software generates a draft transcript. Modern ASR systems typically achieve 80–92% accuracy on clear audio and provide initial timestamps and speaker segmentation.
Step 3 — Human transcriptionist review
A professional transcriptionist listens to the recording and corrects the draft. They resolve terminology errors, punctuation issues, speaker labels, and unclear audio sections.
Step 4 — Editing and quality assurance
A second reviewer verifies the transcript against the audio, ensuring accuracy, formatting consistency, and correct timestamps.
Step 5 — Delivery and archiving
The final transcript is delivered in formats such as DOCX, PDF, TXT, or SRT, depending on the client’s needs. Many providers store files securely for future reference.
To improve transcription accuracy, organisations should record in quiet environments, use quality microphones, provide lists of technical terms, introduce speakers clearly, and minimise overlapping speech.
Types of Transcription Services: Verbatim, Intelligent Verbatim, Edited, and Domain-Specific
Transcription services vary depending on the purpose of the transcript. The correct style depends on how the text will be used — whether as a legal record, research dataset, internal documentation, or publication-ready content.
| Type | What it captures | What it omits | Best for |
| Verbatim | Every word spoken, including filler words, repetitions, and non-verbal sounds such as [um], [uh], [laughter] | Nothing — full record of speech | Legal proceedings, depositions, compliance records, linguistic research |
| Intelligent verbatim (clean verbatim) | Meaningful speech with grammar lightly cleaned | Filler words, false starts, repeated phrases, most non-verbal sounds | Meetings, interviews, podcasts, focus groups, most business use cases |
| Edited transcription | Polished, structured text rewritten for clarity and readability | Informal speech patterns, filler content, repetitions | Articles, reports, presentations, marketing content |
| Phonetic transcription | Pronunciation rendered in phonetic symbols (IPA) | Semantic meaning | Linguistics research, dialect studies, language learning |
| Medical transcription | Clinical dictation, patient records, procedure notes | Non-medical content | Healthcare documentation workflows |
| Legal transcription | Depositions, hearings, interviews, legal recordings | Non-legal content | Court preparation, discovery, legal records |
Verbatim vs Intelligent Verbatim: Which Transcription Style Does Your Use Case Require?

Choosing between verbatim and intelligent verbatim is one of the most important decisions when commissioning transcription.
Verbatim transcription records every spoken element, including filler words, pauses, and repeated phrases. For example:
“Um… so I—I think the contract was, uh, signed in March… or maybe April.”
This style is required for legal proceedings, compliance documentation, and research analysing language patterns, where hesitation, pauses, and wording choices may carry meaning.
Intelligent verbatim removes filler language and minor repetitions while preserving the speaker’s intended message. The same statement becomes:
“I think the contract was signed in March or April.”
This cleaner format is usually preferred for business meetings, interviews, podcasts, training materials, and research summaries, where readability matters more than capturing every spoken hesitation.
If the style is not specified, most transcription providers default to intelligent verbatim, so it is best to confirm your preference before submitting audio.
Domain-Specific Transcription: Medical, Legal, Technical, and Media Transcription Requirements
Many transcription projects require domain expertise to ensure terminology and context are handled correctly.
Medical transcription converts physician dictation into structured patient documentation such as operative reports, discharge summaries, and clinical notes. Because patient safety depends on precise terminology, the industry standard is 98.5%+ accuracy, often reviewed by trained medical language specialists.
Legal transcription covers depositions, court hearings, interrogations, and client interviews. Output is typically verbatim, and formatting must follow legal documentation standards so transcripts can be used in litigation or discovery.
Technical transcription involves specialised vocabulary used in engineering, software development, or scientific research. Without domain familiarity, automated tools frequently misinterpret technical terms such as software protocols or scientific processes.
Media transcription supports journalism, podcasts, and documentary production. The key requirement is usually speed, with transcripts delivered within 24–48 hours to support production or publishing schedules.
Audio Transcription vs Video Transcription vs Live/Real-Time Transcription: Format Differences and Output Types
Transcription services also vary based on the source format and delivery requirements.
Audio transcription converts recordings such as interviews, meetings, podcasts, and phone calls into written transcripts. The output is typically a text document (DOCX, PDF, or TXT) with optional timestamps and speaker labels.
Video transcription works from video files or video links. Clients may request either a plain transcript or a caption file (SRT or VTT) with synchronised timestamps for video playback.
Live or real-time transcription captures speech during live events, meetings, or court sessions. Two approaches are common:
- CART (Communication Access Realtime Translation), where a stenographer produces near-instant text with about 98% accuracy.
- AI live captioning, used in tools such as Zoom or Teams, which typically provides 85–92% accuracy and often requires post-event correction for accessibility compliance.
For organisations choosing a service, the key decision is whether the output should be a simple transcript, a timed caption file, or live captions during an event.
AI Transcription vs Human Transcription: Accuracy Benchmarks, Cost Trade-offs, and B2B Decision Guide
Choosing between AI and human transcription depends on the audio quality, accuracy requirement, and intended use of the transcript. AI tools provide speed and low cost, while human transcription delivers higher accuracy and reliability. Most B2B workflows combine both approaches to balance efficiency with professional-quality output.
| Dimension | AI Transcription | Human Transcription |
| Accuracy (clean audio) | 85–95% | 98–99%+ |
| Accuracy (poor audio: accents, noise, crosstalk) | 60–80% | 95–98% |
| Specialist terminology handling | Limited without domain models | Strong with experienced transcriptionists |
| Speaker diarisation | Automated but imperfect | Verified and accurate |
| Turnaround time | Near-instant (<5 min per hour of audio) | Typically 4–24 hours |
| Cost | $0.05–$0.25 per minute | $0.75–$3.00 per minute |
| Data privacy | Depends on the cloud provider | NDA, encryption, secure workflows |
| Style and formatting compliance | Minimal | Custom formatting and style guides |
| Legal/compliance suitability | Not reliable without review | Suitable for regulated content |
| Publication readiness | Requires editing | Ready for review |
How AI Transcription Tools Work: ASR, Word Error Rate, and Why Clean Audio Is Not Enough
AI transcription tools such as Whisper, AWS Transcribe, Google Speech-to-Text, AssemblyAI, Otter.ai, and Descript rely on automatic speech recognition (ASR) models that convert audio signals into text.
The process involves four main stages: audio preprocessing, acoustic modelling to detect phonemes, language modelling to predict word sequences, and decoding to generate the final transcript.
Accuracy is measured using Word Error Rate (WER):
WER = (Substitutions + Deletions + Insertions) ÷ Total Words × 100
A 10% WER means roughly 10 errors per 100 words. In a one-hour interview containing about 9,000 words, that equates to approximately 900 errors requiring manual correction.
Real-world audio often lowers accuracy because ASR systems perform best with clean, single-speaker recordings. Meeting recordings, phone calls, strong accents, and specialist terminology commonly reduce accuracy to 70–85%.
AI-only transcription is usually sufficient only when the audio is clean, single-speaker, and used for low-stakes internal notes.
Human Transcription: What Professional Transcriptionists Do That AI Cannot Replicate
Human transcriptionists add capabilities that AI tools still struggle to replicate.
First, they use contextual understanding to correct errors. For example, medical or legal terms misinterpreted by ASR can be recognised and corrected by a specialist transcriptionist.
Second, humans handle accents, dialects, and code-switching more effectively than automated models. Strong regional accents or multilingual conversations often cause significant ASR errors that humans can resolve.
Third, humans interpret overlapping speech and conversational context, assigning the correct dialogue to each speaker even during interruptions.
Finally, professional transcriptionists apply formatting standards and style guidelines, ensuring transcripts follow specific templates or documentation rules.
Human transcription is recommended when transcripts will be published, submitted as evidence, used in regulated industries, or contain specialist terminology.
AI + Human Hybrid Transcription: The MTPE Model Applied to Audio and Why It Is the B2B Standard
Most professional transcription services now use a hybrid workflow combining AI and human review. This model mirrors the machine-translation post-editing (MTPE) approach used in translation workflows.
The process begins with an AI-generated draft transcript created through ASR. A professional transcriptionist then reviews the draft while listening to the audio, correcting errors, verifying speaker labels, and adding formatting. Finally, a second reviewer performs a quality assurance pass to ensure accuracy.
This approach improves both speed and cost efficiency. Instead of typing from scratch, the transcriptionist edits an existing draft, increasing productivity significantly.
Hybrid workflows typically achieve 98–99% accuracy while costing 30–50% less than fully manual transcription, making them the standard model for most B2B transcription services.
Transcription Service Pricing: Per-Minute, Per-Word, and Per-Page Models Explained
Transcription pricing varies depending on how the service is billed — per minute of audio, per word of transcript, or per page of output. Pricing also changes based on quality tier (AI-only, hybrid, or fully human), audio complexity, number of speakers, and turnaround time requirements.
| Model | How it’s calculated | Typical range | Best for |
| Per audio minute | Rate × number of minutes of audio | AI: $0.05–$0.25/min; Hybrid: $0.50–$1.50/min; Human: $1.00–$3.00/min | Standard transcription; audio-heavy workflows |
| Per output word | Rate × number of words in the transcript | $0.006–$0.02/word | Long-form content; research transcription |
| Per page (transcript page) | Rate × number of pages of transcript | $0.50–$3.00/page | Legal and medical transcription with standard page formats |
| Hourly subscription | Monthly fee for defined audio hours | $20–$200/month | High-volume recurring transcription |
| Rush/priority surcharge | % added to base rate | +25–100% | Urgent delivery (2–4 hour TAT) |
Per-Minute vs Per-Word Pricing: How to Calculate True Transcription Cost for Your Audio Volume
Per-minute and per-word pricing are not directly comparable without converting audio duration to expected transcript length.
Average spoken English ranges between 130–150 words per minute, meaning one hour of audio produces roughly 7,800–9,000 words. For example, hybrid transcription at $1.25 per audio minute costs about $75 for a 60-minute recording, while per-word pricing at $0.012/word would cost approximately $100.80 for an 8,400-word transcript.
Costs may increase due to several factors: poor audio quality, many speakers, faster turnaround requests, extensive timestamping, or verbatim transcription. These conditions increase processing time and typically add $0.10–$0.75 per audio minute.
Compared with in-house transcription, professional services are often cheaper. Staff typically need 4–6 hours to transcribe one hour of audio, making the internal cost higher than outsourcing once transcription exceeds a few hours per month.
Transcription Turnaround Times: Standard TAT, Rush Delivery, and SLA Expectations for B2B Teams
Turnaround time (TAT) is a critical factor for B2B teams managing interviews, research, or compliance documentation.
| Service type | TAT for 1 hour of audio | TAT for 5 hours of audio |
| AI-only (automated) | <5 minutes | <15 minutes |
| Hybrid (ASR + human review) | 4–8 hours | 24–48 hours |
| Human professional | 24–48 hours | 3–5 business days |
| Human specialist (medical/legal) | 24–72 hours | 5–10 business days |
Rush options are available but limited by human review time. A 4-hour turnaround is usually possible only for recordings under one hour. Same-day delivery is common for AI-only workflows and smaller human-reviewed jobs submitted early in the day.
Reliable 1-hour human-reviewed transcription is generally unrealistic due to the time required for review and QA.
For enterprise contracts, B2B buyers should include service-level agreements (SLAs) specifying minimum accuracy targets (often 98–99%), guaranteed turnaround times, revision policies, and escalation procedures for urgent requests.
Transcription Pricing by Industry: Medical, Legal, Market Research, and Media Cost Benchmarks
Specialist transcription services typically cost more than general business transcription because they require domain expertise and higher accuracy thresholds.
Medical transcription is often priced per line rather than per minute, typically $0.07–$0.14 per line for human-reviewed output and $0.04–$0.08 for ASR-assisted workflows. Healthcare systems integrated with electronic health records may use platform-specific pricing structures.
Legal transcription generally ranges from $1.50–$4.00 per audio minute, depending on complexity and whether the transcript must be verbatim or certified for legal use.
Market research transcription, such as focus groups or qualitative interviews, usually costs $1.00–$2.00 per minute, often with speaker labels and rapid turnaround to support analysis.
Media and podcast transcription is slightly cheaper, typically $0.75–$1.50 per minute, while AI-generated drafts for show notes or SEO content can cost as little as $0.10–$0.25 per minute. However, broadcast compliance transcripts still require human review to meet regulatory standards.
Transcription Quality Standards and Confidentiality: What B2B Buyers Should Require from a Provider
Quality and confidentiality are the two factors that separate professional transcription services from consumer speech-to-text tools.
For B2B teams handling sensitive audio or regulated documentation, both must be clearly defined in procurement requirements and service agreements.
Transcription Accuracy Standards: WER Benchmarks, Industry Minimums, and How Providers Measure Quality
Transcription accuracy varies depending on audio quality, subject matter complexity, and the service tier used. B2B buyers should always define a target accuracy rate and confirm how a provider measures it.
| Standard | Accuracy rate | Context |
| Consumer AI tools (Otter, Whisper auto) | 85–92% | Internal notes and draft transcripts |
| Professional hybrid (ASR + human review) | 97–99% | Business meetings, interviews, media |
| Human specialist transcription | 98–99.5% | Legal, medical, technical content |
| AHDI medical standard | ≥98.5% | Healthcare documentation |
| NCRA court reporting standard | 95%+ real-time; 98%+ certified | Legal proceedings |
Accuracy is typically measured using Word Error Rate (WER):
WER = (Substitutions + Deletions + Insertions) ÷ Reference Words
A transcript with 99% accuracy (1% WER) contains roughly 8–9 errors per 1,000 words.
When evaluating providers, B2B buyers should ask:
• What accuracy SLA is guaranteed?
• How is accuracy measured (spot checks vs full QA)?
• What happens if accuracy targets are missed?
• Are QA reviewers independent from the original transcriptionist?
• How are complex audio files escalated for additional review?
For specialist transcription, buyers should also verify relevant credentials such as AHDI-trained medical transcriptionists, legal formatting familiarity, or ISO 9001 quality management certification for enterprise suppliers.
Transcription Confidentiality: NDA, Data Security, HIPAA, and What to Require for Sensitive Audio Content
Many recordings sent for transcription contain confidential business, legal, medical, or financial information, making data protection a critical procurement requirement.
| Content type | Required protections |
| Internal business meetings | NDA, encrypted transfer (TLS 1.2+), secure portal access |
| Legal recordings | NDA, encrypted storage, access logs, secure deletion policy |
| Medical recordings | HIPAA Business Associate Agreement, AES-256 encryption, compliant data handling |
| Financial services calls | NDA, encryption, regulatory record-keeping compliance |
| Research interviews | GDPR compliance, anonymisation options, consent documentation |
Confidentiality agreements should apply to all individuals handling the audio, including freelance transcriptionists or subcontractors. Contracts should also prohibit using submitted audio for AI model training, benchmarking, or product development.
For healthcare organisations in the United States, transcription providers handling patient recordings must sign a HIPAA Business Associate Agreement (BAA). Using consumer transcription tools without a BAA can constitute a HIPAA violation, even if no data breach occurs.
Evaluating a Transcription Service Provider: 7 Questions B2B Buyers Should Ask Before Commissioning
Before commissioning transcription services, B2B buyers should verify that the provider meets professional standards for accuracy, security, and workflow transparency.
Key questions include:
- What accuracy SLA do you guarantee, and how is it measured?
- Is your service AI-only, hybrid (AI + human review), or fully human transcription?
- Does a separate QA editor review transcripts independently?
- Do you assign domain-experienced transcriptionists for specialist content?
- What compliance standards do you follow (HIPAA, GDPR, ISO 27001, SOC 2)?
- What is your revision or correction policy if accuracy requirements are not met?
- How are subcontractors governed — do they sign individual NDAs and follow the same security protocols?
These questions help organisations distinguish between consumer transcription platforms and professional B2B transcription providers, ensuring transcripts meet required accuracy and confidentiality standards.
Professional Transcription Services with Guaranteed Accuracy, Specialist Domain Coverage, and B2B Confidentiality Standards

Circle Translations delivers professional transcription for B2B teams that need speed, accuracy, and confidentiality — not consumer-grade auto-captions.
Every transcription engagement includes:
✓ Hybrid workflow: ASR + professional human review — 98–99%+ accuracy, 24–48 hour standard turnaround
✓ Specialist domain coverage: legal, medical, technical, market research, and media content
✓ Verbatim or intelligent verbatim transcription — confirmed at project start
✓ Speaker identification and timestamps formatted to your specification
✓ Signed NDA for all projects; HIPAA BAA available; GDPR-compliant data handling
✓ Multiple output formats: DOCX, PDF, TXT, SRT, VTT
✓ Audio translation option: transcription + translation from the same source file
✓ Secure upload and delivery: encrypted transfer and controlled access
Ready to Get Your Audio Transcribed?
Circle Translations provides accurate, reliable transcription for B2B teams — with flexible pricing models, guaranteed quality standards, and fast turnaround across every deliverable format.
Frequently Asked Questions – What Is a Transcription Service
Who needs transcription services — which industries and job roles use them most?
Transcription services are used across legal, healthcare, research, media, corporate, and academic sectors. Common use cases include depositions, clinical dictation, research interviews, podcast transcripts, meeting records, and lecture documentation. Typical buyers include legal assistants, research managers, media producers, L&D teams, clinical documentation staff, and executive assistants managing recorded content.
What is the difference between a transcript and a transcription?
Transcription is the process of converting audio or video into written text. A transcript is the final written document produced from that process. In simple terms, transcription refers to the service, while the transcript is the deliverable provided to the client.
Can ChatGPT or other AI tools transcribe audio accurately enough for professional use?
AI transcription tools such as Whisper, Google Speech-to-Text, and AWS Transcribe typically achieve 80–92% accuracy on clean audio. This is suitable for draft notes or internal reference. Professional use cases — such as legal, medical, or published content — usually require 98–99% accuracy, meaning AI transcripts must be reviewed and corrected by a human editor.
Will AI replace professional transcriptionists?
AI is replacing the typing stage of transcription, but not the review and quality assurance stage. Most professional workflows now use AI-generated drafts followed by human editing. Human reviewers correct terminology, speaker labels, and formatting, ensuring transcripts meet professional accuracy standards.
How is transcription different from audio translation?
Transcription converts spoken audio into text in the same language. Audio translation converts spoken audio into text in a different language. Audio translation usually combines two steps: transcribing the original audio and then translating the transcript into the target language.
How long does it take to transcribe one hour of audio?
AI-only transcription can generate a draft transcript in under five minutes. Professional hybrid transcription (AI + human review) typically takes 4–8 hours, while fully manual transcription can require 24–48 hours, depending on audio complexity.
What audio file formats do transcription services accept?
Most transcription services accept common audio formats, including MP3, WAV, M4A, FLAC, AAC, OGG, WMA, and AIFF, as well as video formats such as MP4, MOV, MKV, AVI, and WebM. Providers often also accept links to YouTube, Vimeo, or cloud storage files.
How do I know if my audio quality is good enough for professional transcription?
Audio is suitable for transcription if speech is clear, background noise is minimal, and speakers do not overlap frequently. Poor audio quality can lead to [inaudible] sections or lower accuracy. Using noise-reduction tools before submission can improve results.
Can a transcription service also translate the transcript into another language?
Yes. Many providers offer audio translation, where the audio is transcribed in the original language and then translated into another language in the same workflow. This is commonly used for multilingual interviews, international research, and cross-border business communications.