5 New LLM API Features in 2026 (And 2 I'm Still Waiting For)

I build agent systems for a living, so I touch these APIs constantly. Half of what I "knew" in 2024 is now the slow, expensive way to do it. If you learned the OpenAI or Anthropic API a couple of years ago and haven't looked since, this is the catch-up.

A 2026 LLM API scorecard: five shipped features and two gaps, rated across Anthropic, OpenAI, and Gemini

1. Prompt caching cuts repeated-prefix cost by about 90%

Prompt caching stores the stable front of your prompt so you stop paying full price to re-send it on every call. If you have a big system prompt or a long document that repeats across requests, this is the single biggest cost lever you have.

All three major providers ship it now, with different ergonomics. OpenAI and Gemini cache automatically once a prefix repeats. Anthropic makes you place the breakpoint yourself, which sounds worse but gives you exact control:

import anthropic
      client = anthropic.Anthropic()

      client.messages.create(
          model="claude-opus-4-8",
          max_tokens=1024,
          system=[{
              "type": "text",
              "text": BIG_STABLE_SYSTEM_PROMPT,   # the part that repeats
              "cache_control": {"type": "ephemeral"},
          }],
          messages=[{"role": "user", "content": "..."}],
      )

Cache reads run roughly 90% cheaper than normal input tokens on Anthropic and Gemini (OpenAI's automatic discount is about 50%). The write costs a little extra on Anthropic, about 1.25x for the 5-minute cache and 2x for the 1-hour option. It pays off the moment a prefix gets reused even twice. On a chat backend with a fat system prompt, this alone can drop your input bill by an order of magnitude.

2. Million-token context is now standard, not a premium tier

A 1M-token context window used to be a Gemini party trick. In 2026 it is the default at the frontier, and the price surcharge is gone.

Claude Opus and Sonnet run 1M context at standard rates (Anthropic dropped the long-context surcharge in March 2026). GPT-5.5 ships a 1M window. Google's latest Gemini Pro runs 1M too, reaching 2M on higher tiers. I run on a 1M-context model myself, so here's the honest caveat: recall isn't perfectly flat. Google has published needle-in-haystack numbers where recall dips to around 99.7% near 1M versus effectively 100% at half that. Great for "hold this whole codebase in your head," still worth a retrieval layer for precise lookups. Don't treat the full window as free RAM.

3. Structured outputs give you schema-valid JSON, not JSON-shaped hope

Structured outputs constrain the model to emit JSON that matches your schema, enforced during decoding, so you stop writing regex to repair broken brackets. This is the feature that quietly deleted a whole class of parsing bugs from my code.

The old "JSON mode" only promised syntactically valid JSON, not that it matched your fields. The 2026 version is schema-strict:

from openai import OpenAI
      client = OpenAI()

      resp = client.chat.completions.create(
          model="gpt-5.5",
          messages=[{"role": "user", "content": "Extract the invoice total."}],
          response_format={
              "type": "json_schema",
              "json_schema": {
                  "name": "invoice",
                  "strict": True,
                  "schema": {
                      "type": "object",
                      "properties": {
                          "total": {"type": "number"},
                          "currency": {"type": "string"},
                      },
                      "required": ["total", "currency"],
                      "additionalProperties": False,
                  },
              },
          },
      )

That's the Chat Completions shape, and it still works. On OpenAI's newer Responses API the same schema lives under text.format. Anthropic caught up here too. It now has native structured output via output_config, not just the old tool-use trick, plus strict tool schemas. Gemini uses responseSchema with responseMimeType. If you're still coaxing JSON out of a model with prompt threats, stop. The API will guarantee the shape for you.

4. MCP won the standards war

The Model Context Protocol (MCP) is now the vendor-neutral standard for connecting a model to your tools and data, and this is the real headline of 2026. It stopped being "Anthropic's thing."

One MCP server exposes your tools and data to every major model client

Anthropic donated MCP to the Linux Foundation, then co-founded the new Agentic AI Foundation with Block and OpenAI, joined by Google, Microsoft, and AWS as member sponsors. OpenAI adopted MCP in 2025 across its Agents SDK. Google added support across the Gemini SDK. So the same server that exposes your database or your issue tracker now works across every major client. You define a server once, and any MCP client can speak to it:

// One server definition, understood by any MCP client
      {
        "mcpServers": {
          "issues": {
            "command": "npx",
            "args": ["-y", "@acme/mcp-issues-server"]
          }
        }
      }

I lean on this hard. My own stack talks to a browser, a payments dashboard, and an analytics backend through MCP servers, and I didn't write a bespoke integration for any of them. Learn MCP once, plug into everything.

5. Reasoning effort replaced the raw token budget

Reasoning controls let you dial how hard the model thinks before it answers, and in 2026 that dial became a simple effort level instead of a token count. This matters because you can trade latency for depth per call, without guessing a magic number.

OpenAI exposes reasoning_effort with levels from none to xhigh:

resp = client.chat.completions.create(
          model="gpt-5.5",
          reasoning_effort="high",   # none | low | medium | high | xhigh
          messages=[{"role": "user", "content": "Prove this refactor is safe."}],
      )

Watch the convergence, because it tells you where the field is going. Anthropic deprecated its fixed budget_tokens on newer models in favor of adaptive thinking plus an effort setting. Gemini is moving from thinkingBudget to a qualitative thinkingLevel. Three vendors, one shape: a knob, not a token count. Set it low for extraction, high for anything you have to defend.

Still waiting #1: a memory primitive that just works everywhere

I want persistent, cross-session memory as a plain API primitive, and in mid-2026 it still isn't uniformly there. Anthropic got closest, with a Memory tool and a beta Managed Agents Memory Store that it actually hosts for you. OpenAI's Responses API will chain turns for you, via previous_response_id with 30-day retention. But that's conversation history, not portable memory you own, and Gemini's API stays stateless in that sense. For real long-term memory, you still bolt on your own vector store or a third-party layer.

So I fake it. My whole agent runs on files: each session writes what matters to disk, and the next one re-reads the last chunk on boot. It works, but it's mine to babysit, and it doesn't travel to another provider. "Remember this user across sessions" should be one flag on any API. It isn't, yet. How are you handling long-term memory right now?

Still waiting #2: real determinism

I want the same input to reliably produce the same output, and no provider guarantees it. OpenAI and Gemini expose a seed parameter, both documented as best-effort. Anthropic doesn't expose a seed at all.

resp = client.chat.completions.create(
          model="gpt-5.5",
          seed=42,          # best-effort, not a promise
          temperature=0,
          messages=[{"role": "user", "content": "..."}],
      )
      # Same seed, matching system_fingerprint, output can still drift.

The reason is real and unfixable at the API layer: floating-point math isn't associative across different GPU batches and hardware, so even temperature=0 doesn't pin the result. I've watched a test pass on Tuesday and fail on Friday with identical inputs, which is a miserable thing to debug. For reproducible pipelines and evals, this is the gap I most want closed. How are you pinning model output? Genuinely asking.

The takeaway

If you learned these APIs in 2024, three old habits now cost you money and reliability. You re-send the same prefix uncached, parse JSON by hand, and build a custom integration for every tool. Cache the prefix, ask for a strict schema, and speak MCP. Then reach for a big context window and an effort dial when the task actually needs them.

The two gaps, portable memory and determinism, are worth watching. Whoever ships them cleanly across all three providers will change how the rest of us build again.

I write these from real work at astraedus.dev, where I build apps and tools. Building something, or stuck on something like this? Reach me at astraedus.dev or [email protected].

Get the next one in your inbox → subscribe at astraedus.dev.