Unconstrained performance on the world's most advanced consumer silicon.
| Metric | Value |
|---|---|
| Largest LLM | 397B params (~189 GB) |
| Memory Bandwidth | 819 GB/s |
| Unified Memory | 512 GB |
| Search Providers | 3 |
Kaizen: ~400ms vs Cloud Average: ~1500ms
Kaizen: 100% vs Cloud: Variable
Kaizen: Unlimited (RAG + Mem0)
| Model | Parameters | Active | GGUF Size | Speed | Use Case |
|---|---|---|---|---|---|
| max:voice | 35B (MoE) | 3B | ~23GB (Q4_K_M) | 42.9 tok/s | Voice, fast responses |
| max:deep | 397B (MoE) | 17B | ~189GB (Q3_K) | 17.6 tok/s | Deep reasoning |
| max:think | 397B (MoE) | 17B | ~189GB (Q3_K) | 17.5 tok/s | Extended thinking |
Mixture-of-Experts: only 17B of 397B parameters active per token — datacenter-scale intelligence at conversational speed on consumer silicon.
| Service | Port | Role | Status |
|---|---|---|---|
| Ollama | 11434 | LLM Inference (v0.17.5, Qwen3.5 MoE) | Active |
| WebSearch Proxy | 11435 | Search, Memory, Context Injection (v1.3.0) | Active |
| Orchestrator | 11440 | Dashboard API, Memory Proxy, Service Control | Active |
| Whisper STT | 8002 | MLX Speech-to-Text | Active |
| Hybrid TTS | 8003 | Kokoro/Piper Voice Synthesis | Active |
| Memory Service | 8100 | Mem0 + ChromaDB Personal Memory | Active |
| OpenWebUI | 8080 | Web Chat Interface (v0.8.1) | Active |
| Z.AI Proxy | 5001 | GLM-5, GLM-4.7, GLM-4.6 Cloud Models | Active |
| Claude Proxy | 5002 | Claude Opus 4.6, Sonnet 4.6, Haiku 4.5 | Active |
| Codex Proxy | 5003 | GPT-5.2 Codex, GPT-5.1 Codex Max | Active |
| Caddy | 8443 | HTTPS Reverse Proxy | Active |
| Cloudflare Tunnel | — | Secure External Access | Active |
| Glances | 61208 | System Monitoring | Active |
Qwen3.5-397B uses only 17B of 397B parameters per token, enabling a datacenter-scale model to run at conversational speed on consumer hardware. The full 189GB model fits entirely in the M3 Ultra's 512GB unified memory.
3-step context chain: identity enforcement, conditional hardware specs (only when asked), and personal memory from Mem0. The model never volunteers hardware details unprompted — Response Style Rules enforce natural conversation.
Intercepts LLM requests to inject real-time data from 3 search providers (SerpAPI, Brave, Google PSE), Knowledge Graph data, weather, and sports scores before the model sees the prompt. Standalone /v1/search API serves cloud AI proxies too.
Custom-built pipeline with Kokoro neural TTS (quality) and Piper (speed/fallback). 10-stage text normalization strips markdown, code blocks, think tokens, and HTML before synthesis. 10 voices, 24kHz output.
Running natively on macOS for maximum resource efficiency and direct Metal GPU access. All 13 services managed by kaizen.sh with health monitoring — no container overhead.
Powered by Ollama v0.17.5 running natively on macOS with Metal GPU acceleration.
Mac Studio M3 Ultra | 512GB Unified Memory | 80-Core GPU | 819 GB/s Bandwidth
max:voice 42.9 tok/s | max:deep 17.6 tok/s | max:think 17.5 tok/s
Native Metal inference with zero container overhead — the model has direct access to all 80 GPU cores and 512GB of unified memory.