Full MCP Goes Live, Plain English Beats JSON

OpenAI Enables Full MCP Write Actions for Enterprise

ChatGPT Enterprise users can now do more than just read data—they can take action. OpenAI rolled out complete Model Context Protocol support with write and modify capabilities, turning ChatGPT into an agent that can update CRMs, create tasks, and orchestrate workflows across your company's tools.

The implementation is thoughtfully gated: admins control connector access via RBAC, and users see explicit confirmation modals before any write operation executes. Developer mode lets you build and test custom connectors, though there's a catch—you can't update connectors after publishing, and mobile support isn't available yet.

Natural Language Destroys JSON for Tool Calling

Forget JSON schemas. New research demonstrates that plain English beats structured formats for tool calling by a significant margin—18 percentage points higher accuracy, 70% less variance, and 31% fewer tokens.

The Natural Language Tools framework works immediately with any LLM without API changes or fine-tuning. DeepSeek-V3 jumped from 78% to 95% accuracy just by switching approaches.

Why does this work? Structured formats create task interference—models struggle to simultaneously understand queries, select tools, maintain syntax, and generate responses. Natural language decouples these concerns with a simple three-stage pattern: think through relevance, mark tools YES/NO, execute selected tools, then respond.

Meta's 1B Model Runs 128k Context Locally

Meta released MobileLLM-Pro, a 1.08B parameter model that handles 128k token contexts entirely on-device. The model uses 3:1 local-to-global attention to slash prefill latency and reduce KV cache from 117MB to 40MB.

The quantization story is compelling: int4 compression shows less than 1.3% quality degradation, making this viable for mobile and edge deployment. It outperforms both Gemma 3 1B and Llama 3.2 1B on reasoning and long-context tasks while training on less than 2T fully open-source tokens.

For builders shipping offline-capable applications, this proves you don't need billions of parameters for production-grade performance.

The Seahorse Emoji That Reveals How Hallucination Works

Here's a fascinating experiment: ask ChatGPT if a seahorse emoji exists. It'll confidently say yes—because it genuinely believes one exists based on widespread false memories in training data.

The model builds an internal representation of "seahorse + emoji" in its neural network, fully convinced it's about to output this character. But when it reaches the output layer and searches its vocabulary, nothing matches. So it outputs semantically similar alternatives: 🐠, 🦐, sometimes even 🦄 or 🐉.

Different models handle the realization differently. Some loop through hundreds of attempts. Some eventually correct themselves mid-response. GPT-5 has been observed spiraling through random animals before essentially giving up with frustrated messages.

This exposes two critical issues: sycophantic behavior (prioritizing user agreement over accuracy) and the complete decoupling of confidence from correctness. Models can be 100% certain about things that are 100% false.

What This Means for Builders

If you're shipping AI systems in production:

Test MCP write actions if you're on Enterprise, but audit connectors thoroughly before enabling write permissions.

Experiment with natural language tool calling if JSON approaches show high variance or underperformance.

Evaluate local models for on-device use cases where 128k context and quantization tolerance matter.

Design for hallucination as a certainty. Layer verification, explicit confirmation loops, and graceful failure handling into any workflow where fabricated answers could cause damage.

The tooling is here. The question is whether you're building for capability, reliability, and epistemic honesty simultaneously.

Get the full weekly AI engineering insights →

Want deeper analysis on AI engineering and product strategy? Subscribe to the Lighthouse Newsletter for comprehensive breakdowns every week.