# AI Inference & TEE

Inference runs inside Intel TDX (Trusted Domain Extensions) hardware, provided by Chutes.ai. A TEE is a physical CPU region isolated from the operating system, hypervisor, and platform operator. Code and data inside a TEE are encrypted at the silicon level — the operator cannot read memory, inspect variables, or intercept outputs. The hardware enforces this cryptographically.

ZDrive uses a tiered fallback chain to maximize availability. If a model is overloaded or returns a 429/503, the worker automatically retries the next model in the chain without any client-side intervention.

## Model availability

| Tier     | Primary       | Fallback chain                                                              |
| -------- | ------------- | --------------------------------------------------------------------------- |
| **Paid** | User-selected | DeepSeek-V3.1-TEE → DeepSeek-R1-0528-TEE → MiniMax-M2.5-TEE → Qwen3-32B-TEE |
| **Free** | Qwen3-32B-TEE | MiniMax-M2.5-TEE                                                            |

The active model is returned in the `X-Zdrive-Model-Used` response header on every inference call.

## Inference flow

```mermaid
sequenceDiagram
    participant B as Browser
    participant W as CF Worker
    participant KV as KV Store
    participant BC as Base Chain
    participant T as Chutes TEE

    B->>W: POST /v1/chat/completions
    Note over B,W: {x_wallet, x_wallet_sig, x_session, messages, model}
    W->>W: verifyWalletIdentity() via ERC-1271
    W->>BC: credits(wallet) → balance
    W->>KV: Check rate limit key
    alt balance > 0 (Paid)
        W->>W: Use requested model
        W->>KV: Increment hourly burst counter
    else balance == 0 (Free / Connected)
        W->>W: Override model → Qwen3-32B-TEE
        W->>W: Check datacenter ASN block
        W->>KV: Increment daily counter
    end
    loop Fallback chain
        W->>T: POST /chat/completions {model, messages, stream: true}
        alt 200 OK
            T-->>W: SSE stream
            W-->>B: SSE stream (X-Zdrive-Model-Used header)
            W->>BC: consumeCredit(wallet) async
        else 429 / 503
            W->>W: Try next model in chain
        else Other error
            W-->>B: Error response (no retry)
        end
    end
```

## Attestation

Every TEE model has a `chute_id` that maps to a running enclave on Chutes.ai infrastructure. The attestation endpoint verifies:

1. **TDX quote** — cryptographic proof the enclave is genuine Intel TDX hardware
2. **GPU evidence** — proof inference ran on a confidential GPU (NVIDIA Blackwell/Hopper)
3. **Model hash** — SHA-256 of the TDX quote, serving as a fingerprint of the running binary

```mermaid
sequenceDiagram
    participant B as Browser
    participant W as CF Worker
    participant C as Chutes API

    B->>W: GET /v1/attestation/report?model=deepseek-ai/DeepSeek-V3.1-TEE
    W->>W: Look up chute_id for model
    W->>C: GET /chutes/{chuteId}/evidence?nonce={32-byte random hex}
    C-->>W: {evidence: [{quote, gpu_evidence, instance_id}]}
    W->>W: Check quote present → tdx_verified = true
    W->>W: Check gpu_evidence present → gpu_verified = true
    W->>W: Extract GPU arch (e.g. BLACKWELL)
    W->>W: SHA-256(TDX quote) → model_hash
    W-->>B: {gpu_tee_verified, fpif_verified, model_hash, gpu_arch, instance_count}
```

> **Note:** Attestation proves the TEE ran the claimed code in isolated hardware. It does not prove the model's output is correct or unbiased — only that the execution environment was tamper-resistant.

## Input limits

| Tier | Max input chars | Equivalent tokens (approx) |
| ---- | --------------- | -------------------------- |
| Free | 16,000          | \~4,000                    |
| Paid | 64,000          | \~16,000                   |


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.zdrive.io/architecture/inference.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
