Raw Power

Unconstrained performance on the world's most advanced consumer silicon.


Key Metrics

Metric Value
Largest LLM 397B params (~189 GB)
Memory Bandwidth 819 GB/s
Unified Memory 512 GB
Search Providers 3

Speed vs Cloud

Voice Response Latency

Kaizen: ~400ms vs Cloud Average: ~1500ms

Privacy Security

Kaizen: 100% vs Cloud: Variable

Context Window Retention

Kaizen: Unlimited (RAG + Mem0)


Model Performance

Model Parameters Active GGUF Size Speed Use Case
max:voice 35B (MoE) 3B ~23GB (Q4_K_M) 42.9 tok/s Voice, fast responses
max:deep 397B (MoE) 17B ~189GB (Q3_K) 17.6 tok/s Deep reasoning
max:think 397B (MoE) 17B ~189GB (Q3_K) 17.5 tok/s Extended thinking

Mixture-of-Experts: only 17B of 397B parameters active per token — datacenter-scale intelligence at conversational speed on consumer silicon.


System Benchmarks

Hardware

  • CPU: 32-core (24 Performance + 8 Efficiency)
  • GPU: 80-core Metal Optimized
  • Neural Engine: 32-core
  • Unified Memory: 512 GB (819 GB/s bandwidth)
  • Storage: 8TB internal + 40TB NAS (10GbE)

Software Pipeline

  • Inference: Ollama v0.17.5 (qwen35moe architecture, native Metal)
  • Voice STT: Whisper Large v3 Turbo (MLX, ~200-400ms)
  • Voice TTS: Kokoro (Neural) + Piper (Fallback), 24kHz output
  • Vector DB: Mem0 + ChromaDB (mxbai-embed-large, 1024-dim)

Services Architecture

Service Port Role Status
Ollama 11434 LLM Inference (v0.17.5, Qwen3.5 MoE) Active
WebSearch Proxy 11435 Search, Memory, Context Injection (v1.3.0) Active
Orchestrator 11440 Dashboard API, Memory Proxy, Service Control Active
Whisper STT 8002 MLX Speech-to-Text Active
Hybrid TTS 8003 Kokoro/Piper Voice Synthesis Active
Memory Service 8100 Mem0 + ChromaDB Personal Memory Active
OpenWebUI 8080 Web Chat Interface (v0.8.1) Active
Z.AI Proxy 5001 GLM-5, GLM-4.7, GLM-4.6 Cloud Models Active
Claude Proxy 5002 Claude Opus 4.6, Sonnet 4.6, Haiku 4.5 Active
Codex Proxy 5003 GPT-5.2 Codex, GPT-5.1 Codex Max Active
Caddy 8443 HTTPS Reverse Proxy Active
Cloudflare Tunnel Secure External Access Active
Glances 61208 System Monitoring Active

Innovations

Mixture-of-Experts Efficiency

Qwen3.5-397B uses only 17B of 397B parameters per token, enabling a datacenter-scale model to run at conversational speed on consumer hardware. The full 189GB model fits entirely in the M3 Ultra's 512GB unified memory.

Intelligent Context Injection

3-step context chain: identity enforcement, conditional hardware specs (only when asked), and personal memory from Mem0. The model never volunteers hardware details unprompted — Response Style Rules enforce natural conversation.

Smart Search Proxy

Intercepts LLM requests to inject real-time data from 3 search providers (SerpAPI, Brave, Google PSE), Knowledge Graph data, weather, and sports scores before the model sees the prompt. Standalone /v1/search API serves cloud AI proxies too.

Hybrid Voice Pipeline

Custom-built pipeline with Kokoro neural TTS (quality) and Piper (speed/fallback). 10-stage text normalization strips markdown, code blocks, think tokens, and HTML before synthesis. 10 voices, 24kHz output.

Zero-Docker Architecture

Running natively on macOS for maximum resource efficiency and direct Metal GPU access. All 13 services managed by kaizen.sh with health monitoring — no container overhead.


Single-Node Architecture

Powered by Ollama v0.17.5 running natively on macOS with Metal GPU acceleration.

Mac Studio M3 Ultra | 512GB Unified Memory | 80-Core GPU | 819 GB/s Bandwidth

max:voice 42.9 tok/s | max:deep 17.6 tok/s | max:think 17.5 tok/s

Native Metal inference with zero container overhead — the model has direct access to all 80 GPU cores and 512GB of unified memory.