This is a hands-on companion to our earlier post on The Privacy-First AI Stack. If you’re looking for the conceptual overview, start there. If you want code, jump in.

12 min read

Every day, thousands of companies send their most sensitive data to public AI models. Patient records. Legal briefs. Financial projections. Source code. Customer PII.

All of it flows through OpenAI, Anthropic, Google, and other public AI providers — unencrypted, uncontrolled, and often untracked. For startups building consumer apps, this might be acceptable. For healthcare providers, legal firms, financial institutions, and government contractors? It’s a compliance nightmare waiting to happen.

But here’s the good news: you don’t have to choose between AI capabilities and data privacy.

In this deep dive, I’ll walk you through setting up end-to-end encrypted AI inference using NOMYO AI — a self-funded, privacy-first AI platform that wraps every prompt and response in military-grade encryption so your data is never visible to the provider, not even for a millisecond.


What Is End-to-End Encrypted Inference?

Traditional cloud AI works like this:

Your Data → Plaintext → AI Provider → Plaintext Results → Your App

Your data leaves your infrastructure as plain text, travels over the internet (TLS-encrypted, yes, but the provider sees it), gets processed, and comes back as plain text. The provider sees everything. They cache it. They may use it for training. You have no control.

End-to-end encrypted inference changes that equation entirely:

Your Data → Encrypted → AI Processing → Encrypted Results → Your Decrypted Output
          ↑                    ↑                    ↑
      Never seen         Never seen           Never seen
      by provider        by provider          by provider

Your data is encrypted client-side before it ever leaves your server. The AI provider processes ciphertext — they literally cannot see what they’re processing. Results come back encrypted and are decrypted only on your end.

This isn’t theoretical. It’s achievable today with the right architecture. And in this guide, I’ll show you exactly how to set it up with NOMYO.


The NOMYO Architecture: How It Actually Works

NOMYO uses a hybrid encryption scheme that combines the speed of AES-256-GCM for payload encryption with RSA-OAEP (4096-bit) for secure key exchange. Here’s the technical breakdown:

1. Hybrid Encryption

  • AES-256-GCM encrypts your actual prompt and response payloads. This is authenticated encryption — it provides both confidentiality and integrity in a single operation.
  • RSA-OAEP (4096-bit) handles the key exchange. The server’s public key is fetched once, and your client generates a fresh AES-256 key for each request, encrypts it with RSA, and sends it along with the encrypted payload.

2. Forward Secrecy

Every single inference gets a unique AES-256 key. Keys are generated via secrets.token_bytes() and zeroed from memory immediately after use. Even if a key is somehow compromised, only a single inference is affected — all other requests remain secure.

3. Secure Memory Protection

Plaintext payloads are protected from being swapped to disk. All crypto material is zeroed immediately after encryption. No core dumps. No page files. No memory inspection can recover your data.

4. TPM 2.0 Hardware Attestation (Maximum Tier)

When the server has a TPM 2.0 chip, every response includes a cryptographically signed hardware quote proving:

  • Which firmware and Secure Boot state the server is running (PCR 0, 7)
  • Which application binary is running (PCR 10, if IMA is active)
  • The quote is signed by an ephemeral AIK (Attestation Identity Key) generated fresh for each request and tied to a payload_id nonce, so it cannot be replayed

This is the kind of hardware-level attestation you’d expect in defense applications — now available for AI inference.


Step 1: Installation

Setting up NOMYO takes about 5 minutes. Here’s how:

Prerequisites

  • Python 3.7 or higher
  • pip (Python package installer)
  • A paid subscription at chat.nomyo.ai (required for API access)

Install from PyPI

pip install nomyo
# Create virtual environment
python -m venv nomyo_env

# Activate it
source nomyo_env/bin/activate  # Linux/Mac
# or
nomyo_env\Scripts\activate     # Windows

# Install nomyo
pip install nomyo

Verify Installation

import nomyo
print("NOMYO client installed successfully!")

That’s it. The package automatically installs all dependencies including cryptography, httpx, certifi, and the async compatibility layer.


Step 2: Basic Setup — Your First Encrypted Inference

Let’s start with the simplest possible example. NOMYO provides a SecureChatCompletion class that’s a drop-in replacement for OpenAI’s ChatCompletion API. Same interface. Same parameters. Zero rewrites.

import asyncio
from nomyo import SecureChatCompletion

async def main():
    # Initialize client (defaults to https://api.nomyo.ai)
    client = SecureChatCompletion(api_key="your-api-key-here")

    # Simple chat completion — data is encrypted before it leaves your machine
    response = await client.create(
        model="Qwen/Qwen3-0.6B",
        messages=[
            {"role": "user", "content": "Hello! How are you today?"}
        ],
        temperature=0.7
    )

    # Extract the response, then delete it immediately
    reply = response['choices'][0]['message']['content']
    del response  # Minimize decrypted data lifetime in memory

    print(reply)

asyncio.run(main())

That’s it. The encryption and decryption happen transparently. You don’t manage keys manually. You don’t handle crypto. NOMYO does all of that internally.

With System Messages

import asyncio
from nomyo import SecureChatCompletion

async def main():
    client = SecureChatCompletion(api_key="your-api-key-here")

    response = await client.create(
        model="Qwen/Qwen3-0.6B",
        messages=[
            {"role": "system", "content": "You are a helpful legal assistant."},
            {"role": "user", "content": "Summarize the key clauses in this contract..."}
        ],
        temperature=0.7
    )

    print(response['choices'][0]['message']['content'])

asyncio.run(main())

Step 3: Understanding Security Tiers

NOMYO provides three security tiers that let you match the encryption level to your data sensitivity:

Tier What It Includes Use Case
standard Full E2E encryption + secure tokenizer General secure inference
high Full E2E + secure memory guaranteed Sensitive business data
maximum High + TPM 2.0 hardware attestation HIPAA PHI, classified data

Here’s what each tier guarantees:

  • standard — Your prompt and response are encrypted with AES-256-GCM before leaving your machine. A secure tokenizer is enforced, preventing the provider from reconstructing your data from token IDs alone.

  • high — Everything in standard, plus secure memory is guaranteed. Plaintext payloads are protected from being swapped to disk, and all crypto material is zeroed immediately after use.

  • maximum — Everything in high, plus the server must have a TPM 2.0 chip. Every response includes a cryptographically signed hardware attestation proving the server’s firmware, Secure Boot state, and application binary integrity via PCR measurements. If the server lacks TPM 2.0, the request is rejected.

import asyncio
from nomyo import SecureChatCompletion

async def use_security_tiers():
    client = SecureChatCompletion(api_key="your-api-key-here")

    # Standard: General queries
    response1 = await client.create(
        model="Qwen/Qwen3-0.6B",
        messages=[{"role": "user", "content": "What's the weather today?"}],
        security_tier="standard"
    )

    # High: Sensitive business data
    response2 = await client.create(
        model="Qwen/Qwen3-0.6B",
        messages=[{"role": "user", "content": "What's my bank account balance?"}],
        security_tier="high"
    )

    # Maximum: HIPAA PHI or classified data
    response3 = await client.create(
        model="Qwen/Qwen3.5-9B",
        messages=[{"role": "user", "content": "Share my personal medical records"}],
        security_tier="maximum"
    )

asyncio.run(use_security_tiers())

Step 4: Key Management — Ephemeral Keys by Default

By default, NOMYO uses ephemeral keys — a fresh key pair is generated for every session and zeroed from memory when the session ends. This is the recommended approach for most use cases, as it provides maximum security with zero operational overhead.

import asyncio
from nomyo import SecureChatCompletion

async def ephemeral_keys():
    # Ephemeral keys are the default — no setup needed
    client = SecureChatCompletion(api_key="your-api-key-here")

    # Each session gets a fresh key pair.
    # When the client is garbage collected, the keys are zeroed.
    response = await client.create(
        model="Qwen/Qwen3.5-9B",
        messages=[{"role": "user", "content": "Analyze this document..."}],
        security_tier="high"
    )

    print(response['choices'][0]['message']['content'])

asyncio.run(ephemeral_keys())

Why ephemeral keys are the best practice:

  • No persistent keys on disk — even if an attacker gains filesystem access, there’s nothing to steal.
  • Automatic cleanup — when the session ends, the key material is zeroed from memory.
  • Forward secrecy — each session is independent. Compromising one session reveals nothing about others.
  • Zero operational overhead — no password management, no key rotation schedules, no secret vaults.

Optional: Persistent Keys (For Long-Lived Services)

If you’re running a long-lived service and want to persist keys across restarts, you can opt in:

import asyncio
from nomyo import SecureChatCompletion

async def persistent_keys():
    client = SecureChatCompletion(api_key="your-api-key-here")

    # Generate keys with password protection and save to disk
    await client.generate_keys(
        save_to_file=True,
        key_dir="client_keys",
        password="your-strong-password-here"
    )

    # Load previously saved keys on subsequent runs
    await client.load_keys(
        "client_keys/private_key.pem",
        "client_keys/public_key.pem",
        password="your-strong-password-here"
    )

    # ... use as normal ...

asyncio.run(persistent_keys())

When to use persistent keys:

  • Long-running daemon processes where regenerating keys on each request is undesirable
  • Multi-process architectures that share a key pair
  • Compliance requirements that mandate key persistence with specific rotation policies

Production best practices for persistent keys:

  • Always use password protection for private keys
  • Keep private key file permissions at 600 (owner-only access)
  • Never share your private key
  • Verify the server’s public key fingerprint before first use
  • Use HTTPS connections only (never HTTP in production)

Step 5: Working with Models

NOMYO offers 15+ open-source models, all available with E2E encryption by default. Here’s the current lineup:

Model Parameters Best For
Qwen/Qwen3-0.6B 0.6B Low latency, edge use
Qwen/Qwen3.5-0.8B 0.8B Lightweight fast inference
LiquidAI/LFM2.5-1.2B-Thinking 1.2B Reasoning, chain-of-thought
Qwen/Qwen3.5-9B 9B Balanced quality and speed
utter-project/EuroLLM-9B-Instruct-2512 9B Multilingual (European languages)
ServiceNow-AI/Apriel-1.6-15b-Thinker 15B Math, physics, science reasoning
openai/gpt-oss-20b 20B General purpose
Qwen/Qwen3.5-27B 27B High quality, large context
Qwen/Qwen3.5-35B-A3B 35B (3B active) Highest quality general model
google/medgemma-27b-it 27B Medical domain (additional TOU required)
moonshotai/Kimi-Linear-48B-A3B-Instruct 48B (3B active) Largest capacity, 1M context

Note: MoE (Mixture of Experts) models show total/active parameter counts. Only the active parameters are computed per request, making them extremely efficient.

Note: The google/medgemma-27b-it medical model requires acceptance of additional Terms of Use before it can be accessed.

Choosing the Right Model

import asyncio
from nomyo import SecureChatCompletion

async def model_selection():
    client = SecureChatCompletion(api_key="your-api-key-here")

    # Contract analysis — high sensitivity
    legal_response = await client.create(
        model="Qwen/Qwen3.5-9B",
        messages=[{"role": "user", "content": contract_text}],
        security_tier="high"
    )

    # Highest quality general model
    general_response = await client.create(
        model="Qwen/Qwen3.5-35B-A3B",
        messages=[{"role": "user", "content": complex_query}],
        security_tier="high"
    )

    # Low-latency edge use
    edge_response = await client.create(
        model="Qwen/Qwen3-0.6B",
        messages=[{"role": "user", "content": simple_query}],
        security_tier="standard"
    )

asyncio.run(model_selection())

Step 6: Error Handling and Production Patterns

Real applications need robust error handling. NOMYO provides specific exception classes:

import asyncio
from nomyo import SecureChatCompletion, AuthenticationError, InvalidRequestError, RateLimitError, ServiceUnavailableError

async def robust_inference():
    client = SecureChatCompletion(
        api_key="your-api-key-here",
        max_retries=2  # Automatic retry on 429, 500, 502, 503, 504
    )

    try:
        response = await client.create(
            model="Qwen/Qwen3.5-9B",
            messages=[{"role": "user", "content": sensitive_data}],
            security_tier="high"
        )
        print(response['choices'][0]['message']['content'])

    except AuthenticationError as e:
        print(f"Authentication failed: {e}")
        # Check your API key
    except InvalidRequestError as e:
        print(f"Invalid request: {e}")
        # Check model name, message format, parameters
    except RateLimitError as e:
        print(f"Rate limit exceeded: {e}")
        # Implement backoff
    except ServiceUnavailableError as e:
        print(f"Service unavailable: {e}")
        # For maximum tier: server may lack TPM 2.0
    except Exception as e:
        print(f"Unexpected error: {e}")

asyncio.run(robust_inference())

Rate Limits

The NOMYO API enforces rate limits to ensure fair usage:

  • Default: 1 request/second
  • Burst: Up to 2 requests/second (twice per 10-second window)
  • Professional plan: 2 req/s with 4 burst
  • Abuse protection: Repeated burst abuse triggers a 30-minute cool-down

Implement exponential backoff in your application:

import asyncio

async def request_with_backoff(client, messages, max_retries=5):
    delay = 0.5
    for attempt in range(max_retries):
        try:
            response = await client.create(
                model="Qwen/Qwen3.5-9B",
                messages=messages
            )
            return response
        except RateLimitError:
            await asyncio.sleep(delay)
            delay = min(delay * 2, 30)
    raise RuntimeError("Rate limit exceeded after maximum retries")

Step 7: Advanced Usage — Tools, Sequential Batching, and Memory Safety

Using Tool Calling

NOMYO supports OpenAI-compatible tool calling, fully encrypted:

import asyncio
from nomyo import SecureChatCompletion

async def chat_with_tools():
    client = SecureChatCompletion(api_key="your-api-key-here")

    response = await client.create(
        model="Qwen/Qwen3.5-9B",
        messages=[
            {"role": "user", "content": "Calculate the mean of: 100, 200, 300, 400"}
        ],
        tools=[
            {
                "type": "function",
                "function": {
                    "name": "calculate_statistics",
                    "description": "Calculate statistical measures",
                    "parameters": {
                        "type": "object",
                        "properties": {
                            "data": {"type": "array", "items": {"type": "number"}}
                        },
                        "required": ["data"]
                    }
                }
            }
        ]
    )

    print(response['choices'][0]['message']['content'])

asyncio.run(chat_with_tools())

Sequential Batch Processing (Rate-Limit Aware)

When processing multiple queries, send them sequentially with appropriate spacing to stay within rate limits:

import asyncio
from nomyo import SecureChatCompletion

async def batch_processing():
    client = SecureChatCompletion(api_key="your-api-key-here")

    queries = [
        {"role": "user", "content": "Analyze this document section A..."},
        {"role": "user", "content": "Analyze this document section B..."},
        {"role": "user", "content": "Analyze this document section C..."},
    ]

    responses = []
    for i, query in enumerate(queries):
        response = await client.create(
            model="Qwen/Qwen3.5-9B",
            messages=[query],
            security_tier="high"
        )
        responses.append(response)
        # Small delay between requests to stay within rate limits
        if i < len(queries) - 1:
            await asyncio.sleep(0.6)

    for i, response in enumerate(responses):
        print(f"Query {i+1}: {response['choices'][0]['message']['content'][:100]}...")

asyncio.run(batch_processing())

Why sequential? NOMYO’s rate limit is 1 req/s (default) with a burst of 2 req/s. Sending concurrent requests can trigger rate limiting. Sequential processing with a small delay is the safe, predictable approach.

Memory Safety — Handling Responses Like a Pro

The NOMYO client library protects all intermediate crypto material (AES keys, raw plaintext bytes) in secure memory and zeros it immediately after use. However, the final parsed response dict is returned to you — and your code is responsible for minimizing how long it lives in memory.

This matters because the response is new data you didn’t have before: a confidential analysis, legal summary, or business-critical output. The longer it lives as a reachable Python object, the larger the exposure window from swap files, core dumps, memory inspection, or garbage collection delays.

# ✅ GOOD — extract what you need, then delete immediately
response = await client.create(
    model="Qwen/Qwen3.5-9B",
    messages=[{"role": "user", "content": sensitive_data}],
    security_tier="high"
)
reply = response["choices"][0]["message"]["content"]
del response  # Drop the full dict immediately

# ... use reply ...
del reply     # Drop when done

# ❌ BAD — holding the full response dict longer than needed
response = await client.create(...)
# ... many lines of unrelated code ...
# response still reachable in memory the entire time
text = response["choices"][0]["message"]["content"]

Note: Python’s del removes the reference and allows the GC to reclaim memory sooner, but does not zero the underlying bytes. For maximum protection (classified data), process the response and discard it as quickly as possible — do not store it in long-lived objects, class attributes, or logs.


Step 8: Verifying Server Attestation (Maximum Tier)

When using the maximum security tier, you can verify the server’s hardware attestation to prove it’s running on trusted hardware:

import asyncio
from nomyo import SecureChatCompletion

async def verify_attestation():
    client = SecureChatCompletion(api_key="your-api-key-here")

    response = await client.create(
        model="Qwen/Qwen3.5-9B",
        messages=[{"role": "user", "content": "..."}],
        security_tier="maximum"
    )

    tpm = response["_metadata"].get("tpm_attestation", {})

    if tpm.get("is_available"):
        print("PCR banks:", tpm["pcr_banks"])         # e.g. "sha256:0,7,10"
        print("PCR values:", tpm["pcr_values"])        # {bank: {index: hex}}
        print("AIK key:", tpm["aik_pubkey_b64"][:32], "...")
    else:
        print("TPM not available on this server")

asyncio.run(verify_attestation())

Full verification requires tpm2-pytss (optional but recommended for maximum security deployments):

pip install tpm2-pytss
sudo apt install libtss2-dev

Putting It All Together: A Real-World Example

Here’s a complete production-ready chat application with encrypted inference:

import asyncio
import os
import logging
from nomyo import SecureChatCompletion, AuthenticationError, ServiceUnavailableError

# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

class SecureLegalAssistant:
    def __init__(self):
        # Load API key from environment variable (never hardcode)
        api_key = os.getenv('NOMYO_API_KEY')
        if not api_key:
            raise ValueError("NOMYO_API_KEY environment variable not set")

        self.client = SecureChatCompletion(
            api_key=api_key,
            secure_memory=True,
            max_retries=2
        )
        self.conversation_history = []

    async def chat(self, user_message: str) -> str:
        """Process a legal query with full E2E encryption."""
        self.conversation_history.append({"role": "user", "content": user_message})

        try:
            response = await self.client.create(
                model="Qwen/Qwen3.5-9B",
                messages=self.conversation_history,
                security_tier="high",
                temperature=0.3
            )

            # Extract and immediately discard response dict
            assistant_message = response["choices"][0]["message"]
            self.conversation_history.append(assistant_message)
            del response  # Minimize memory exposure

            return assistant_message["content"]

        except AuthenticationError as e:
            logger.error(f"Authentication failed: {e}")
            raise
        except ServiceUnavailableError as e:
            logger.error(f"Server TPM requirements not met: {e}")
            raise
        except Exception as e:
            logger.error(f"Unexpected error: {e}")
            raise

async def main():
    assistant = SecureLegalAssistant()

    # First query
    response1 = await assistant.chat(
        "Review the liability clause in this contract. "
        "What risks should the client be aware of?"
    )
    print(f"Assistant: {response1}")

    # Second query
    response2 = await assistant.chat(
        "Based on that analysis, what amendments would you recommend?"
    )
    print(f"Assistant: {response2}")

asyncio.run(main())

Pricing

NOMYO offers straightforward, transparent pricing:

Professional — $199/month

  • 1 API key
  • All 15+ models included
  • All security tiers (standard, high, maximum)
  • 2 req/s (4 burst)
  • Professional human support

Enterprise — Custom

  • Unlimited API keys
  • All models + Maximum security tier
  • Custom rate limits
  • Dedicated support + SLA
  • Custom deployments & compliance

Subscribe at chat.nomyo.ai


Why This Matters Now

Three forces are converging to make privacy-first AI not just desirable but essential:

1. Regulatory Pressure

The EU AI Act is now law. HIPAA is being updated for the AI era. California, Texas, and other jurisdictions are introducing AI-specific regulations. Companies that can’t demonstrate data privacy in their AI pipelines will face fines, legal liability, and loss of customer trust.

2. Customer Expectations

Enterprise customers are asking: “How do you handle our data?” If your answer isn’t satisfactory, they’ll take their business elsewhere. Privacy is becoming a competitive differentiator, not just a compliance checkbox.

3. Cost Pressures

Public AI APIs are getting more expensive. As usage scales, so do costs. A privacy-first stack that includes intelligent model routing can reduce AI costs by 40-60% by routing to the most cost-effective model for each task and using smaller models for simpler workloads.


Getting Started: Your Roadmap

You don’t need to rebuild everything overnight. Here’s a practical phased approach:

Phase 1: Assess (Week 1-2)

  • Audit what data is currently flowing to public AI models
  • Identify compliance requirements for your industry
  • Map all AI use cases across your organization
  • Prioritize by sensitivity and volume

Phase 2: Pilot (Week 3-6)

  • Select 1-2 high-sensitivity use cases
  • Implement encrypted inference for those use cases
  • Set up basic monitoring and logging
  • Measure performance and cost impact

Phase 3: Scale (Month 2-3)

  • Expand encrypted inference to additional use cases
  • Implement intelligent model routing
  • Build compliance documentation
  • Train teams on new workflows

Phase 4: Optimize (Month 4-6)

  • Fine-tune model routing for cost and performance
  • Implement custom models where needed
  • Build comprehensive monitoring and alerting
  • Achieve full compliance certification

Conclusion

The companies that win in the AI era won’t be the ones with the most data. They’ll be the ones that can safest use data.

Privacy-first AI isn’t a constraint — it’s a competitive advantage. And with NOMYO, setting up end-to-end encrypted inference takes less than five minutes. No complex crypto setup. No key management headaches. Just drop in the SecureChatCompletion client, choose your security tier, and start processing sensitive data with the confidence that only you hold the keys.

Ready to build your privacy-first AI stack?


Disclaimer: This is a technical guide for educational purposes. Always consult with your legal and compliance teams before deploying AI systems that process sensitive data. NOMYO provides the encryption infrastructure, but compliance responsibility ultimately rests with the data controller.