🛡️ Safety & Alignment
Prompt injection defenses, jailbreak resistance, hallucination mitigation, PII handling, red-teaming, responsible scaling.
The workflow
flowchart LR
A[Threat model] --> B{Risk type}
B -->|Prompt injection| C[Input sanitization<br/>+ instruction hierarchy]
B -->|Jailbreak| D[Red-team set<br/>refusal training]
B -->|Hallucination| E[Citation +<br/>retrieval grounding]
B -->|PII leak| F[Output filter<br/>+ redaction]
C --> G[Continuous eval<br/>+ monitoring]
D --> G
E --> G
F --> G
Prompt injection is the #1 unsolved AI security issue. Every shipping team needs a layered defense.
Key takeaways
Videos (19)
Prompt Engineering and AI Red Teaming — Sander Schulhoff, HackAPrompt/LearnPrompting
Prompt injection is a fundamentally unsolved AI-security problem and red-teaming competitions like HackAPrompt are the empirical baseline for measuring it.
How to Secure Agents using OAuth — Jared Hanson (Keycard, Passport.js)
Stop pasting long-lived API keys into MCP configs — treat MCP servers as OAuth resource servers and let a real authorization server mediate agent access.
$1 AI Guardrails: The Unreasonable Effectiveness of Finetuned ModernBERTs – Diego Carpentero
A fine-tuned ModernBERT classifier sidecar provides low-latency, sub-dollar guardrails against today's broad LLM attack surface across prompt, context, RAG, and MCP vectors.
Trust, but Verify: Shreya Rajpal
Treat every LLM call as untrusted and wrap it in a verification suite with reask/fix/refrain policies so correctness becomes a programmable property, not a hope.
Why you should care about AI interpretability - Mark Bissell, Goodfire AI
Interpretability tools like Goodfire's Ember are ready to use today — they let AI engineers debug and steer models at the neuron level instead of fighting prompts.
The Unofficial Guide to Apple's Private Cloud Compute - Jmo, CONFSEC
PCC's stack — OHTTP, blind signatures, secure enclaves, code attestation — is a reusable blueprint for making remote AI compute provably private.
How to Build Trustworthy AI — Allie Howe
Trustworthy AI requires shifting right with runtime guardrails alongside build-time scanning and red-teaming because non-determinism means CI/CD alone can't catch failures.
OpenAI on Securing Code-Executing AI Agents — Fouad Matin (Codex, Agent Robustness)
Secure code-executing agents require sandboxing, internet allowlists, and human approval surfaces — Codex CLI is OpenAI's open reference implementation.
How we hacked YC Spring 2025 batch's AI agents — Rene Brandel, Casco
Agents are users, not services — apply classic web-app authn/authz, IDOR mitigation and sandbox hardening downstream of the LLM, not at the prompt layer.
How to defend your sites from AI bots — David Mytton, Arcjet
Layered defenses (robots.txt + UA + reverse-DNS verification + behavioral signals) are needed to distinguish good AI crawlers from training/abuse bots and agent-driven browsers.
CIAM for AI: Authn/Authz for Agents — Michael Grinich, CEO of WorkOS
Agents need a new identity class — neither human nor service-account — with patterns like shadow personas, delegation chains, and capability tokens to keep them safe in enterprise systems.
AI + Security & Safety — Don Bosco Durai
Single-process agent frameworks violate zero-trust — credentials, prompts, and tool outputs need isolation boundaries to prevent injection-based privilege escalation.
LLM Safeguards: Security Privacy Compliance Anti Hallucination: Daniel Whitenack
Production LLM safety requires a layered checklist covering hallucination, supply chain, server resilience, PII leakage, and prompt injection — not a single guardrail.
Securing Agents with Open Standards — Bobby Tiernay and Kam Sween, Auth0
Use open identity standards (OAuth 2.1, token exchange, CIBA, FGA at retrieval) to keep agent actions tied to real users with scoped short-lived tokens, not env-var keys.
AI Frontiers in Trust and Safety Combatting Multifaceted Harm on Tinder at Scale: Vibhor Kumar
Fine-tuning open-source LLMs on hybrid LLM-mined + human-verified data is the practical playbook for trust-and-safety classification at consumer scale.
AI Red Teaming Agent: Azure AI Foundry — Nagkumar Arkalgud & Keiji Kanazawa, Microsoft
Azure AI Foundry packages Microsoft's PyRIT red-teaming toolkit as a managed SDK + dashboard so AI engineers can red-team their own apps without standing up the framework themselves.
Critical AI Inference your CIO can Trust — Sahil Yadav, Hariharan Ganesan, Telemetrak
For CIOs to trust AI in critical systems, MLOps needs to evolve into XTOps with explainability, traceability and guardrails plus business-facing metrics like MTRE and trust-adjusted risk.
Building security around ML: Dr. Andrew Davis
ML security has moved from anti-malware-style ML to defending the ML pipeline itself — poisoning, theft, and adversarial examples remain open problems.
Cognitive Shield Real Time Real Smart - Rachna Srivastava
AI-powered fraud requires AI-powered defenses—layered detection (data hygiene + multi-modal real-time + human escalation) rebuilds trust against deepfakes and synthetic identities.