Raw Power

Unconstrained performance on the world's most advanced consumer silicon.

Key Metrics

Metric	Value
Largest LLM	397B params (~189 GB)
Memory Bandwidth	819 GB/s
Unified Memory	512 GB
Search Providers	3

Speed vs Cloud

Voice Response Latency

Kaizen: ~400ms vs Cloud Average: ~1500ms

Privacy Security

Kaizen: 100% vs Cloud: Variable

Context Window Retention

Kaizen: Unlimited (RAG + Mem0)

Model Performance

Model	Parameters	Active	GGUF Size	Speed	Use Case
max:voice	35B (MoE)	3B	~23GB (Q4_K_M)	42.9 tok/s	Voice, fast responses
max:deep	397B (MoE)	17B	~189GB (Q3_K)	17.6 tok/s	Deep reasoning
max:think	397B (MoE)	17B	~189GB (Q3_K)	17.5 tok/s	Extended thinking

Mixture-of-Experts: only 17B of 397B parameters active per token — datacenter-scale intelligence at conversational speed on consumer silicon.

System Benchmarks

Hardware

CPU: 32-core (24 Performance + 8 Efficiency)
GPU: 80-core Metal Optimized
Neural Engine: 32-core
Unified Memory: 512 GB (819 GB/s bandwidth)
Storage: 8TB internal + 40TB NAS (10GbE)

Software Pipeline

Inference: Ollama v0.17.5 (qwen35moe architecture, native Metal)
Voice STT: Whisper Large v3 Turbo (MLX, ~200-400ms)
Voice TTS: Kokoro (Neural) + Piper (Fallback), 24kHz output
Vector DB: Mem0 + ChromaDB (mxbai-embed-large, 1024-dim)

Services Architecture

Service	Port	Role	Status
Ollama	11434	LLM Inference (v0.17.5, Qwen3.5 MoE)	Active
WebSearch Proxy	11435	Search, Memory, Context Injection (v1.3.0)	Active
Orchestrator	11440	Dashboard API, Memory Proxy, Service Control	Active
Whisper STT	8002	MLX Speech-to-Text	Active
Hybrid TTS	8003	Kokoro/Piper Voice Synthesis	Active
Memory Service	8100	Mem0 + ChromaDB Personal Memory	Active
OpenWebUI	8080	Web Chat Interface (v0.8.1)	Active
Z.AI Proxy	5001	GLM-5, GLM-4.7, GLM-4.6 Cloud Models	Active
Claude Proxy	5002	Claude Opus 4.6, Sonnet 4.6, Haiku 4.5	Active
Codex Proxy	5003	GPT-5.2 Codex, GPT-5.1 Codex Max	Active
Caddy	8443	HTTPS Reverse Proxy	Active
Cloudflare Tunnel	—	Secure External Access	Active
Glances	61208	System Monitoring	Active

Innovations

Mixture-of-Experts Efficiency

Qwen3.5-397B uses only 17B of 397B parameters per token, enabling a datacenter-scale model to run at conversational speed on consumer hardware. The full 189GB model fits entirely in the M3 Ultra's 512GB unified memory.

Intelligent Context Injection

3-step context chain: identity enforcement, conditional hardware specs (only when asked), and personal memory from Mem0. The model never volunteers hardware details unprompted — Response Style Rules enforce natural conversation.

Smart Search Proxy

Intercepts LLM requests to inject real-time data from 3 search providers (SerpAPI, Brave, Google PSE), Knowledge Graph data, weather, and sports scores before the model sees the prompt. Standalone /v1/search API serves cloud AI proxies too.

Hybrid Voice Pipeline

Custom-built pipeline with Kokoro neural TTS (quality) and Piper (speed/fallback). 10-stage text normalization strips markdown, code blocks, think tokens, and HTML before synthesis. 10 voices, 24kHz output.

Zero-Docker Architecture

Running natively on macOS for maximum resource efficiency and direct Metal GPU access. All 13 services managed by kaizen.sh with health monitoring — no container overhead.

Single-Node Architecture

Mac Studio M3 Ultra | 512GB Unified Memory | 80-Core GPU | 819 GB/s Bandwidth

max:voice 42.9 tok/s | max:deep 17.6 tok/s | max:think 17.5 tok/s

Native Metal inference with zero container overhead — the model has direct access to all 80 GPU cores and 512GB of unified memory.