This is a hands-on companion to our earlier post on The Privacy-First AI Stack. If you’re looking for the conceptual overview, start there. If you want code, jump in.
12 min read
Every day, thousands of companies send their most sensitive data to public AI models. Patient records. Legal briefs. Financial projections. Source code. Customer PII.
All of it flows through OpenAI, Anthropic, Google, and other public AI providers — unencrypted, uncontrolled, and often untracked. For startups building consumer apps, this might be acceptable. For healthcare providers, legal firms, financial institutions, and government contractors? It’s a compliance nightmare waiting to happen.
But here’s the good news: you don’t have to choose between AI capabilities and data privacy.
In this deep dive, I’ll walk you through setting up end-to-end encrypted AI inference using NOMYO AI — a self-funded, privacy-first AI platform that wraps every prompt and response in military-grade encryption so your data is never visible to the provider, not even for a millisecond.
What Is End-to-End Encrypted Inference?
Traditional cloud AI works like this:
Your Data → Plaintext → AI Provider → Plaintext Results → Your App
Your data leaves your infrastructure as plain text, travels over the internet (TLS-encrypted, yes, but the provider sees it), gets processed, and comes back as plain text. The provider sees everything. They cache it. They may use it for training. You have no control.
End-to-end encrypted inference changes that equation entirely:
Your Data → Encrypted → AI Processing → Encrypted Results → Your Decrypted Output
↑ ↑ ↑
Never seen Never seen Never seen
by provider by provider by provider
Your data is encrypted client-side before it ever leaves your server. The AI provider processes ciphertext — they literally cannot see what they’re processing. Results come back encrypted and are decrypted only on your end.
This isn’t theoretical. It’s achievable today with the right architecture. And in this guide, I’ll show you exactly how to set it up with NOMYO.
The NOMYO Architecture: How It Actually Works
NOMYO uses a hybrid encryption scheme that combines the speed of AES-256-GCM for payload encryption with RSA-OAEP (4096-bit) for secure key exchange. Here’s the technical breakdown:
1. Hybrid Encryption
- AES-256-GCM encrypts your actual prompt and response payloads. This is authenticated encryption — it provides both confidentiality and integrity in a single operation.
- RSA-OAEP (4096-bit) handles the key exchange. The server’s public key is fetched once, and your client generates a fresh AES-256 key for each request, encrypts it with RSA, and sends it along with the encrypted payload.
2. Forward Secrecy
Every single inference gets a unique AES-256 key. Keys are generated via secrets.token_bytes() and zeroed from memory immediately after use. Even if a key is somehow compromised, only a single inference is affected — all other requests remain secure.
3. Secure Memory Protection
Plaintext payloads are protected from being swapped to disk. All crypto material is zeroed immediately after encryption. No core dumps. No page files. No memory inspection can recover your data.
4. TPM 2.0 Hardware Attestation (Maximum Tier)
When the server has a TPM 2.0 chip, every response includes a cryptographically signed hardware quote proving:
- Which firmware and Secure Boot state the server is running (PCR 0, 7)
- Which application binary is running (PCR 10, if IMA is active)
- The quote is signed by an ephemeral AIK (Attestation Identity Key) generated fresh for each request and tied to a
payload_idnonce, so it cannot be replayed
This is the kind of hardware-level attestation you’d expect in defense applications — now available for AI inference.
Step 1: Installation
Setting up NOMYO takes about 5 minutes. Here’s how:
Prerequisites
- Python 3.7 or higher
- pip (Python package installer)
- A paid subscription at chat.nomyo.ai (required for API access)
Install from PyPI
pip install nomyo
Use a Virtual Environment (Recommended)
# Create virtual environment
python -m venv nomyo_env
# Activate it
source nomyo_env/bin/activate # Linux/Mac
# or
nomyo_env\Scripts\activate # Windows
# Install nomyo
pip install nomyo
Verify Installation
import nomyo
print("NOMYO client installed successfully!")
That’s it. The package automatically installs all dependencies including cryptography, httpx, certifi, and the async compatibility layer.
Step 2: Basic Setup — Your First Encrypted Inference
Let’s start with the simplest possible example. NOMYO provides a SecureChatCompletion class that’s a drop-in replacement for OpenAI’s ChatCompletion API. Same interface. Same parameters. Zero rewrites.
import asyncio
from nomyo import SecureChatCompletion
async def main():
# Initialize client (defaults to https://api.nomyo.ai)
client = SecureChatCompletion(api_key="your-api-key-here")
# Simple chat completion — data is encrypted before it leaves your machine
response = await client.create(
model="Qwen/Qwen3-0.6B",
messages=[
{"role": "user", "content": "Hello! How are you today?"}
],
temperature=0.7
)
# Extract the response, then delete it immediately
reply = response['choices'][0]['message']['content']
del response # Minimize decrypted data lifetime in memory
print(reply)
asyncio.run(main())
That’s it. The encryption and decryption happen transparently. You don’t manage keys manually. You don’t handle crypto. NOMYO does all of that internally.
With System Messages
import asyncio
from nomyo import SecureChatCompletion
async def main():
client = SecureChatCompletion(api_key="your-api-key-here")
response = await client.create(
model="Qwen/Qwen3-0.6B",
messages=[
{"role": "system", "content": "You are a helpful legal assistant."},
{"role": "user", "content": "Summarize the key clauses in this contract..."}
],
temperature=0.7
)
print(response['choices'][0]['message']['content'])
asyncio.run(main())
Step 3: Understanding Security Tiers
NOMYO provides three security tiers that let you match the encryption level to your data sensitivity:
| Tier | What It Includes | Use Case |
|---|---|---|
standard |
Full E2E encryption + secure tokenizer | General secure inference |
high |
Full E2E + secure memory guaranteed | Sensitive business data |
maximum |
High + TPM 2.0 hardware attestation | HIPAA PHI, classified data |
Here’s what each tier guarantees:
-
standard— Your prompt and response are encrypted with AES-256-GCM before leaving your machine. A secure tokenizer is enforced, preventing the provider from reconstructing your data from token IDs alone. -
high— Everything in standard, plus secure memory is guaranteed. Plaintext payloads are protected from being swapped to disk, and all crypto material is zeroed immediately after use. -
maximum— Everything in high, plus the server must have a TPM 2.0 chip. Every response includes a cryptographically signed hardware attestation proving the server’s firmware, Secure Boot state, and application binary integrity via PCR measurements. If the server lacks TPM 2.0, the request is rejected.
import asyncio
from nomyo import SecureChatCompletion
async def use_security_tiers():
client = SecureChatCompletion(api_key="your-api-key-here")
# Standard: General queries
response1 = await client.create(
model="Qwen/Qwen3-0.6B",
messages=[{"role": "user", "content": "What's the weather today?"}],
security_tier="standard"
)
# High: Sensitive business data
response2 = await client.create(
model="Qwen/Qwen3-0.6B",
messages=[{"role": "user", "content": "What's my bank account balance?"}],
security_tier="high"
)
# Maximum: HIPAA PHI or classified data
response3 = await client.create(
model="Qwen/Qwen3.5-9B",
messages=[{"role": "user", "content": "Share my personal medical records"}],
security_tier="maximum"
)
asyncio.run(use_security_tiers())
Step 4: Key Management — Ephemeral Keys by Default
By default, NOMYO uses ephemeral keys — a fresh key pair is generated for every session and zeroed from memory when the session ends. This is the recommended approach for most use cases, as it provides maximum security with zero operational overhead.
Default: Ephemeral Keys (Recommended)
import asyncio
from nomyo import SecureChatCompletion
async def ephemeral_keys():
# Ephemeral keys are the default — no setup needed
client = SecureChatCompletion(api_key="your-api-key-here")
# Each session gets a fresh key pair.
# When the client is garbage collected, the keys are zeroed.
response = await client.create(
model="Qwen/Qwen3.5-9B",
messages=[{"role": "user", "content": "Analyze this document..."}],
security_tier="high"
)
print(response['choices'][0]['message']['content'])
asyncio.run(ephemeral_keys())
Why ephemeral keys are the best practice:
- No persistent keys on disk — even if an attacker gains filesystem access, there’s nothing to steal.
- Automatic cleanup — when the session ends, the key material is zeroed from memory.
- Forward secrecy — each session is independent. Compromising one session reveals nothing about others.
- Zero operational overhead — no password management, no key rotation schedules, no secret vaults.
Optional: Persistent Keys (For Long-Lived Services)
If you’re running a long-lived service and want to persist keys across restarts, you can opt in:
import asyncio
from nomyo import SecureChatCompletion
async def persistent_keys():
client = SecureChatCompletion(api_key="your-api-key-here")
# Generate keys with password protection and save to disk
await client.generate_keys(
save_to_file=True,
key_dir="client_keys",
password="your-strong-password-here"
)
# Load previously saved keys on subsequent runs
await client.load_keys(
"client_keys/private_key.pem",
"client_keys/public_key.pem",
password="your-strong-password-here"
)
# ... use as normal ...
asyncio.run(persistent_keys())
When to use persistent keys:
- Long-running daemon processes where regenerating keys on each request is undesirable
- Multi-process architectures that share a key pair
- Compliance requirements that mandate key persistence with specific rotation policies
Production best practices for persistent keys:
- Always use password protection for private keys
- Keep private key file permissions at 600 (owner-only access)
- Never share your private key
- Verify the server’s public key fingerprint before first use
- Use HTTPS connections only (never HTTP in production)
Step 5: Working with Models
NOMYO offers 15+ open-source models, all available with E2E encryption by default. Here’s the current lineup:
| Model | Parameters | Best For |
|---|---|---|
Qwen/Qwen3-0.6B |
0.6B | Low latency, edge use |
Qwen/Qwen3.5-0.8B |
0.8B | Lightweight fast inference |
LiquidAI/LFM2.5-1.2B-Thinking |
1.2B | Reasoning, chain-of-thought |
Qwen/Qwen3.5-9B |
9B | Balanced quality and speed |
utter-project/EuroLLM-9B-Instruct-2512 |
9B | Multilingual (European languages) |
ServiceNow-AI/Apriel-1.6-15b-Thinker |
15B | Math, physics, science reasoning |
openai/gpt-oss-20b |
20B | General purpose |
Qwen/Qwen3.5-27B |
27B | High quality, large context |
Qwen/Qwen3.5-35B-A3B |
35B (3B active) | Highest quality general model |
google/medgemma-27b-it |
27B | Medical domain (additional TOU required) |
moonshotai/Kimi-Linear-48B-A3B-Instruct |
48B (3B active) | Largest capacity, 1M context |
Note: MoE (Mixture of Experts) models show total/active parameter counts. Only the active parameters are computed per request, making them extremely efficient.
Note: The
google/medgemma-27b-itmedical model requires acceptance of additional Terms of Use before it can be accessed.
Choosing the Right Model
import asyncio
from nomyo import SecureChatCompletion
async def model_selection():
client = SecureChatCompletion(api_key="your-api-key-here")
# Contract analysis — high sensitivity
legal_response = await client.create(
model="Qwen/Qwen3.5-9B",
messages=[{"role": "user", "content": contract_text}],
security_tier="high"
)
# Highest quality general model
general_response = await client.create(
model="Qwen/Qwen3.5-35B-A3B",
messages=[{"role": "user", "content": complex_query}],
security_tier="high"
)
# Low-latency edge use
edge_response = await client.create(
model="Qwen/Qwen3-0.6B",
messages=[{"role": "user", "content": simple_query}],
security_tier="standard"
)
asyncio.run(model_selection())
Step 6: Error Handling and Production Patterns
Real applications need robust error handling. NOMYO provides specific exception classes:
import asyncio
from nomyo import SecureChatCompletion, AuthenticationError, InvalidRequestError, RateLimitError, ServiceUnavailableError
async def robust_inference():
client = SecureChatCompletion(
api_key="your-api-key-here",
max_retries=2 # Automatic retry on 429, 500, 502, 503, 504
)
try:
response = await client.create(
model="Qwen/Qwen3.5-9B",
messages=[{"role": "user", "content": sensitive_data}],
security_tier="high"
)
print(response['choices'][0]['message']['content'])
except AuthenticationError as e:
print(f"Authentication failed: {e}")
# Check your API key
except InvalidRequestError as e:
print(f"Invalid request: {e}")
# Check model name, message format, parameters
except RateLimitError as e:
print(f"Rate limit exceeded: {e}")
# Implement backoff
except ServiceUnavailableError as e:
print(f"Service unavailable: {e}")
# For maximum tier: server may lack TPM 2.0
except Exception as e:
print(f"Unexpected error: {e}")
asyncio.run(robust_inference())
Rate Limits
The NOMYO API enforces rate limits to ensure fair usage:
- Default: 1 request/second
- Burst: Up to 2 requests/second (twice per 10-second window)
- Professional plan: 2 req/s with 4 burst
- Abuse protection: Repeated burst abuse triggers a 30-minute cool-down
Implement exponential backoff in your application:
import asyncio
async def request_with_backoff(client, messages, max_retries=5):
delay = 0.5
for attempt in range(max_retries):
try:
response = await client.create(
model="Qwen/Qwen3.5-9B",
messages=messages
)
return response
except RateLimitError:
await asyncio.sleep(delay)
delay = min(delay * 2, 30)
raise RuntimeError("Rate limit exceeded after maximum retries")
Step 7: Advanced Usage — Tools, Sequential Batching, and Memory Safety
Using Tool Calling
NOMYO supports OpenAI-compatible tool calling, fully encrypted:
import asyncio
from nomyo import SecureChatCompletion
async def chat_with_tools():
client = SecureChatCompletion(api_key="your-api-key-here")
response = await client.create(
model="Qwen/Qwen3.5-9B",
messages=[
{"role": "user", "content": "Calculate the mean of: 100, 200, 300, 400"}
],
tools=[
{
"type": "function",
"function": {
"name": "calculate_statistics",
"description": "Calculate statistical measures",
"parameters": {
"type": "object",
"properties": {
"data": {"type": "array", "items": {"type": "number"}}
},
"required": ["data"]
}
}
}
]
)
print(response['choices'][0]['message']['content'])
asyncio.run(chat_with_tools())
Sequential Batch Processing (Rate-Limit Aware)
When processing multiple queries, send them sequentially with appropriate spacing to stay within rate limits:
import asyncio
from nomyo import SecureChatCompletion
async def batch_processing():
client = SecureChatCompletion(api_key="your-api-key-here")
queries = [
{"role": "user", "content": "Analyze this document section A..."},
{"role": "user", "content": "Analyze this document section B..."},
{"role": "user", "content": "Analyze this document section C..."},
]
responses = []
for i, query in enumerate(queries):
response = await client.create(
model="Qwen/Qwen3.5-9B",
messages=[query],
security_tier="high"
)
responses.append(response)
# Small delay between requests to stay within rate limits
if i < len(queries) - 1:
await asyncio.sleep(0.6)
for i, response in enumerate(responses):
print(f"Query {i+1}: {response['choices'][0]['message']['content'][:100]}...")
asyncio.run(batch_processing())
Why sequential? NOMYO’s rate limit is 1 req/s (default) with a burst of 2 req/s. Sending concurrent requests can trigger rate limiting. Sequential processing with a small delay is the safe, predictable approach.
Memory Safety — Handling Responses Like a Pro
The NOMYO client library protects all intermediate crypto material (AES keys, raw plaintext bytes) in secure memory and zeros it immediately after use. However, the final parsed response dict is returned to you — and your code is responsible for minimizing how long it lives in memory.
This matters because the response is new data you didn’t have before: a confidential analysis, legal summary, or business-critical output. The longer it lives as a reachable Python object, the larger the exposure window from swap files, core dumps, memory inspection, or garbage collection delays.
# ✅ GOOD — extract what you need, then delete immediately
response = await client.create(
model="Qwen/Qwen3.5-9B",
messages=[{"role": "user", "content": sensitive_data}],
security_tier="high"
)
reply = response["choices"][0]["message"]["content"]
del response # Drop the full dict immediately
# ... use reply ...
del reply # Drop when done
# ❌ BAD — holding the full response dict longer than needed
response = await client.create(...)
# ... many lines of unrelated code ...
# response still reachable in memory the entire time
text = response["choices"][0]["message"]["content"]
Note: Python’s
delremoves the reference and allows the GC to reclaim memory sooner, but does not zero the underlying bytes. For maximum protection (classified data), process the response and discard it as quickly as possible — do not store it in long-lived objects, class attributes, or logs.
Step 8: Verifying Server Attestation (Maximum Tier)
When using the maximum security tier, you can verify the server’s hardware attestation to prove it’s running on trusted hardware:
import asyncio
from nomyo import SecureChatCompletion
async def verify_attestation():
client = SecureChatCompletion(api_key="your-api-key-here")
response = await client.create(
model="Qwen/Qwen3.5-9B",
messages=[{"role": "user", "content": "..."}],
security_tier="maximum"
)
tpm = response["_metadata"].get("tpm_attestation", {})
if tpm.get("is_available"):
print("PCR banks:", tpm["pcr_banks"]) # e.g. "sha256:0,7,10"
print("PCR values:", tpm["pcr_values"]) # {bank: {index: hex}}
print("AIK key:", tpm["aik_pubkey_b64"][:32], "...")
else:
print("TPM not available on this server")
asyncio.run(verify_attestation())
Full verification requires tpm2-pytss (optional but recommended for maximum security deployments):
pip install tpm2-pytss
sudo apt install libtss2-dev
Putting It All Together: A Real-World Example
Here’s a complete production-ready chat application with encrypted inference:
import asyncio
import os
import logging
from nomyo import SecureChatCompletion, AuthenticationError, ServiceUnavailableError
# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
class SecureLegalAssistant:
def __init__(self):
# Load API key from environment variable (never hardcode)
api_key = os.getenv('NOMYO_API_KEY')
if not api_key:
raise ValueError("NOMYO_API_KEY environment variable not set")
self.client = SecureChatCompletion(
api_key=api_key,
secure_memory=True,
max_retries=2
)
self.conversation_history = []
async def chat(self, user_message: str) -> str:
"""Process a legal query with full E2E encryption."""
self.conversation_history.append({"role": "user", "content": user_message})
try:
response = await self.client.create(
model="Qwen/Qwen3.5-9B",
messages=self.conversation_history,
security_tier="high",
temperature=0.3
)
# Extract and immediately discard response dict
assistant_message = response["choices"][0]["message"]
self.conversation_history.append(assistant_message)
del response # Minimize memory exposure
return assistant_message["content"]
except AuthenticationError as e:
logger.error(f"Authentication failed: {e}")
raise
except ServiceUnavailableError as e:
logger.error(f"Server TPM requirements not met: {e}")
raise
except Exception as e:
logger.error(f"Unexpected error: {e}")
raise
async def main():
assistant = SecureLegalAssistant()
# First query
response1 = await assistant.chat(
"Review the liability clause in this contract. "
"What risks should the client be aware of?"
)
print(f"Assistant: {response1}")
# Second query
response2 = await assistant.chat(
"Based on that analysis, what amendments would you recommend?"
)
print(f"Assistant: {response2}")
asyncio.run(main())
Pricing
NOMYO offers straightforward, transparent pricing:
Professional — $199/month
- 1 API key
- All 15+ models included
- All security tiers (standard, high, maximum)
- 2 req/s (4 burst)
- Professional human support
Enterprise — Custom
- Unlimited API keys
- All models + Maximum security tier
- Custom rate limits
- Dedicated support + SLA
- Custom deployments & compliance
Why This Matters Now
Three forces are converging to make privacy-first AI not just desirable but essential:
1. Regulatory Pressure
The EU AI Act is now law. HIPAA is being updated for the AI era. California, Texas, and other jurisdictions are introducing AI-specific regulations. Companies that can’t demonstrate data privacy in their AI pipelines will face fines, legal liability, and loss of customer trust.
2. Customer Expectations
Enterprise customers are asking: “How do you handle our data?” If your answer isn’t satisfactory, they’ll take their business elsewhere. Privacy is becoming a competitive differentiator, not just a compliance checkbox.
3. Cost Pressures
Public AI APIs are getting more expensive. As usage scales, so do costs. A privacy-first stack that includes intelligent model routing can reduce AI costs by 40-60% by routing to the most cost-effective model for each task and using smaller models for simpler workloads.
Getting Started: Your Roadmap
You don’t need to rebuild everything overnight. Here’s a practical phased approach:
Phase 1: Assess (Week 1-2)
- Audit what data is currently flowing to public AI models
- Identify compliance requirements for your industry
- Map all AI use cases across your organization
- Prioritize by sensitivity and volume
Phase 2: Pilot (Week 3-6)
- Select 1-2 high-sensitivity use cases
- Implement encrypted inference for those use cases
- Set up basic monitoring and logging
- Measure performance and cost impact
Phase 3: Scale (Month 2-3)
- Expand encrypted inference to additional use cases
- Implement intelligent model routing
- Build compliance documentation
- Train teams on new workflows
Phase 4: Optimize (Month 4-6)
- Fine-tune model routing for cost and performance
- Implement custom models where needed
- Build comprehensive monitoring and alerting
- Achieve full compliance certification
Conclusion
The companies that win in the AI era won’t be the ones with the most data. They’ll be the ones that can safest use data.
Privacy-first AI isn’t a constraint — it’s a competitive advantage. And with NOMYO, setting up end-to-end encrypted inference takes less than five minutes. No complex crypto setup. No key management headaches. Just drop in the SecureChatCompletion client, choose your security tier, and start processing sensitive data with the confidence that only you hold the keys.
Ready to build your privacy-first AI stack?
- Try e2ee.nomyo.ai for encrypted inference
- Explore nomyo.ai for full platform capabilities
- Subscribe at chat.nomyo.ai
- Contact sales for enterprise deployments
Disclaimer: This is a technical guide for educational purposes. Always consult with your legal and compliance teams before deploying AI systems that process sensitive data. NOMYO provides the encryption infrastructure, but compliance responsibility ultimately rests with the data controller.