What I Learned Testing Four LLM Providers for My AI Assistant
The cost of running a full-time AI assistant snuck up on me. Not the server. The tokens.
I run two AI assistants on my server. They handle email, social media, course management, infrastructure monitoring. Around the clock. The conversation context for a persistent agent grows to hundreds of thousands of tokens.
Prompt caching is what makes that affordable. Cache reads cost roughly 12x less than cache writes with Anthropic. As long as your stable context hits cache, each turn is pennies.
Mine stopped hitting.
Calls that should have been cache reads were billing as full writes. A session costing pennies was costing dollars. I tried Google’s Gemini 3.5 Flash. Its OpenAI-compatible endpoint wasn’t caching at all. Every call, full price. OpenAI’s GPT models worked but had their own pricing curves.
After weeks of auditing token usage and testing alternatives, I moved everything to GLM-5.2 from Z.AI. Flat-rate coding plan. Zero per-token cost. All 11 cron jobs, both servers, every sub-agent.
The part I didn’t expect: GLM is genuinely good at agentic work. Complex tool chains, long context, error recovery. I kept Anthropic and Gemini as fallbacks. Haven’t needed them.
OpenClaw is a self-hosted AI assistant that runs on your server, reads your code, and handles tasks while you sleep. I teach two classes on setting up and getting the most from OpenClaw on Udemy: Easy OpenClaw and Get Real Work Done With an AI Assistant.