Model Specifications
Detailed technical specifications for all transcription models supported by FloWords.
Whisper Models Overview
Section titled “Whisper Models Overview”Whisper is OpenAI’s automatic speech recognition (ASR) system. FloWords uses whisper.cpp, an optimized C++ implementation for Apple Silicon.
Model Comparison
Section titled “Model Comparison”| Model | Parameters | Size | VRAM | English WER | Speed Factor |
|---|---|---|---|---|---|
| Tiny | 39M | 75 MB | ~1 GB | 8.4% | ~32x |
| Tiny.en | 39M | 75 MB | ~1 GB | 7.5% | ~32x |
| Base | 74M | 142 MB | ~1.5 GB | 5.0% | ~16x |
| Base.en | 74M | 142 MB | ~1.5 GB | 4.3% | ~16x |
| Small | 244M | 466 MB | ~2 GB | 3.4% | ~6x |
| Small.en | 244M | 466 MB | ~2 GB | 3.0% | ~6x |
| Medium | 769M | 1.5 GB | ~5 GB | 2.5% | ~2x |
| Medium.en | 769M | 1.5 GB | ~5 GB | 2.1% | ~2x |
| Large-v3 | 1550M | 3 GB | ~10 GB | 2.0% | ~1x |
Whisper Model Details
Section titled “Whisper Model Details”Name: Whisper TinyParameters: 39 millionFile Size: 75 MBMemory Usage: ~1 GBLayers: 4 encoder, 4 decoderDimension: 384Heads: 6Best for:
- Quick notes
- Low-memory systems
- Fastest transcription
Trade-offs:
- Lower accuracy
- May struggle with accents
- Limited noise handling
Name: Whisper BaseParameters: 74 millionFile Size: 142 MBMemory Usage: ~1.5 GBLayers: 6 encoder, 6 decoderDimension: 512Heads: 8Best for:
- General daily use
- Good balance of speed and accuracy
- Most Mac configurations
Trade-offs:
- Moderate accuracy
- Some errors with technical terms
Name: Whisper SmallParameters: 244 millionFile Size: 466 MBMemory Usage: ~2 GBLayers: 12 encoder, 12 decoderDimension: 768Heads: 12Best for:
- Better accuracy needs
- Professional work
- When speed isn’t critical
Trade-offs:
- Slower than Tiny/Base
- Higher memory requirements
Medium
Section titled “Medium”Name: Whisper MediumParameters: 769 millionFile Size: 1.5 GBMemory Usage: ~5 GBLayers: 24 encoder, 24 decoderDimension: 1024Heads: 16Best for:
- Professional transcription
- Challenging audio conditions
- Accented speech
Trade-offs:
- Significant memory usage
- Slower processing
- Requires 8GB+ RAM
Large-v3
Section titled “Large-v3”Name: Whisper Large v3Parameters: 1550 millionFile Size: 3 GBMemory Usage: ~10 GBLayers: 32 encoder, 32 decoderDimension: 1280Heads: 20Best for:
- Maximum accuracy
- Difficult audio
- Professional production
Trade-offs:
- Very high memory usage
- Slowest processing
- Requires 16GB+ RAM
English-Only Models
Section titled “English-Only Models”Models ending in .en are optimized for English only:
| Model | Multilingual | English-Only |
|---|---|---|
| Tiny | tiny | tiny.en |
| Base | base | base.en |
| Small | small | small.en |
| Medium | medium | medium.en |
| Large | large-v3 | (No .en variant) |
Advantages of .en Models
Section titled “Advantages of .en Models”- Faster processing - No language detection
- Slightly better accuracy - Optimized for English
- Lower resource usage - Smaller effective vocabulary
Parakeet Models
Section titled “Parakeet Models”NVIDIA’s Parakeet models via FluidAudio framework.
Parakeet RNNT
Section titled “Parakeet RNNT”Name: Parakeet RNNTArchitecture: RNN-TransducerFocus: Real-time English ASRLatency: Very lowStreaming: YesBest for:
- Real-time transcription
- Live dictation
- English content
Parakeet vs Whisper
Section titled “Parakeet vs Whisper”| Aspect | Whisper | Parakeet |
|---|---|---|
| Languages | 99+ | English focus |
| Accuracy | Higher | Good |
| Latency | Higher | Lower |
| Streaming | Limited | Native |
| Memory | Higher | Lower |
Model Performance by Hardware
Section titled “Model Performance by Hardware”Apple Silicon
Section titled “Apple Silicon”| Model | Recommended | Performance |
|---|---|---|
| Tiny | ✓ Excellent | ~30x real-time |
| Base | ✓ Excellent | ~15x real-time |
| Small | ✓ Good | ~5x real-time |
| Medium | ⚠️ Usable | ~2x real-time |
| Large | ⚠️ Slow | ~0.5x real-time |
| Model | Recommended | Performance |
|---|---|---|
| Tiny | ✓ Excellent | ~40x real-time |
| Base | ✓ Excellent | ~20x real-time |
| Small | ✓ Excellent | ~8x real-time |
| Medium | ✓ Good | ~3x real-time |
| Large | ✓ Usable | ~1x real-time |
| Model | Recommended | Performance |
|---|---|---|
| Tiny | ✓ Excellent | ~45x real-time |
| Base | ✓ Excellent | ~22x real-time |
| Small | ✓ Excellent | ~9x real-time |
| Medium | ✓ Good | ~4x real-time |
| Large | ✓ Usable | ~1.2x real-time |
| Model | Recommended | Performance |
|---|---|---|
| Tiny | ✓ Excellent | ~50x real-time |
| Base | ✓ Excellent | ~25x real-time |
| Small | ✓ Excellent | ~12x real-time |
| Medium | ✓ Excellent | ~5x real-time |
| Large | ✓ Good | ~2x real-time |
Intel Macs
Section titled “Intel Macs”| Model | 8GB RAM | 16GB RAM | 32GB+ RAM |
|---|---|---|---|
| Tiny | ✓ Good | ✓ Good | ✓ Good |
| Base | ✓ Usable | ✓ Good | ✓ Good |
| Small | ⚠️ Slow | ✓ Usable | ✓ Good |
| Medium | ❌ Not recommended | ⚠️ Slow | ✓ Usable |
| Large | ❌ Not recommended | ❌ Not recommended | ⚠️ Slow |
Language Support
Section titled “Language Support”Tier 1: Best Support (WER < 5%)
Section titled “Tier 1: Best Support (WER < 5%)”English, Spanish, French, German, Italian, Portuguese, Dutch, Russian, Japanese, Chinese, Korean
Tier 2: Good Support (WER 5-15%)
Section titled “Tier 2: Good Support (WER 5-15%)”Arabic, Czech, Danish, Finnish, Greek, Hebrew, Hindi, Hungarian, Indonesian, Norwegian, Polish, Romanian, Swedish, Thai, Turkish, Ukrainian, Vietnamese
Tier 3: Basic Support (WER > 15%)
Section titled “Tier 3: Basic Support (WER > 15%)”All other 70+ languages supported by Whisper
Audio Specifications
Section titled “Audio Specifications”Input Requirements
Section titled “Input Requirements”| Specification | Value |
|---|---|
| Sample Rate | 16000 Hz |
| Bit Depth | 16-bit |
| Channels | Mono |
| Format | PCM |
FloWords automatically converts audio to these specifications.
Supported Input Formats
Section titled “Supported Input Formats”| Format | Extension | Notes |
|---|---|---|
| WAV | .wav | Native support |
| MP3 | .mp3 | Converted to WAV |
| M4A | .m4a | Converted to WAV |
| AAC | .aac | Converted to WAV |
| FLAC | .flac | Converted to WAV |
| AIFF | .aiff | Converted to WAV |
| CAF | .caf | Converted to WAV |
| MP4 | .mp4 | Audio extracted |
| MOV | .mov | Audio extracted |
Model Selection Guide
Section titled “Model Selection Guide”By Use Case
Section titled “By Use Case”| Use Case | Recommended Model |
|---|---|
| Quick notes | Tiny |
| Daily use | Base |
| Documents | Small |
| Professional | Medium |
| Maximum accuracy | Large-v3 |
By RAM Available
Section titled “By RAM Available”| RAM | Maximum Model |
|---|---|
| 4 GB | Base |
| 8 GB | Small |
| 16 GB | Medium |
| 32 GB+ | Large-v3 |
By Content Type
Section titled “By Content Type”| Content | Recommended |
|---|---|
| Clear speech | Any |
| Background noise | Medium+ |
| Technical terms | Medium+ with dictionary |
| Multiple speakers | Medium+ |
| Accented speech | Medium+ |
Advanced Configuration
Section titled “Advanced Configuration”Model Parameters
Section titled “Model Parameters”| Parameter | Range | Default | Effect |
|---|---|---|---|
| beam_size | 1-10 | 5 | Accuracy vs speed |
| best_of | 1-5 | 1 | Candidates considered |
| temperature | 0.0-1.0 | 0.0 | Prediction randomness |
| patience | 0.0-2.0 | 1.0 | Early stopping |
| length_penalty | 0.0-2.0 | 1.0 | Length bias |
When to Adjust
Section titled “When to Adjust”| Goal | Adjustment |
|---|---|
| Faster | Lower beam_size |
| More accurate | Higher beam_size, best_of |
| More variety | Higher temperature |
| Shorter outputs | Lower length_penalty |
Model Updates
Section titled “Model Updates”Checking for Updates
Section titled “Checking for Updates”FloWords checks for model updates automatically. To manually check:
- Open Settings > Model
- Click Check for Updates
Update Process
Section titled “Update Process”Model updates may include:
- Accuracy improvements
- New language support
- Performance optimizations
- Bug fixes
Next Steps
Section titled “Next Steps”- Download Models to get started
- Configure Settings for your needs
- Review Best Practices for optimal use