← all topics

🛡️ Safety & Alignment

Prompt injection defenses, jailbreak resistance, hallucination mitigation, PII handling, red-teaming, responsible scaling.

19 videos · securitysafetyred-teamingprompt-injectionoauthguardrails

The workflow

flowchart LR
    A[Threat model] --> B{Risk type}
    B -->|Prompt injection| C[Input sanitization<br/>+ instruction hierarchy]
    B -->|Jailbreak| D[Red-team set<br/>refusal training]
    B -->|Hallucination| E[Citation +<br/>retrieval grounding]
    B -->|PII leak| F[Output filter<br/>+ redaction]
    C --> G[Continuous eval<br/>+ monitoring]
    D --> G
    E --> G
    F --> G

Prompt injection is the #1 unsolved AI security issue. Every shipping team needs a layered defense.

Key takeaways

Prompt injection is a fundamentally unsolved AI-security problem and red-teaming competitions like HackAPrompt are the empirical baseline for measuring it.
Stop pasting long-lived API keys into MCP configs — treat MCP servers as OAuth resource servers and let a real authorization server mediate agent access.
A fine-tuned ModernBERT classifier sidecar provides low-latency, sub-dollar guardrails against today's broad LLM attack surface across prompt, context, RAG, and MCP vectors.
Treat every LLM call as untrusted and wrap it in a verification suite with reask/fix/refrain policies so correctness becomes a programmable property, not a hope.
Interpretability tools like Goodfire's Ember are ready to use today — they let AI engineers debug and steer models at the neuron level instead of fighting prompts.
PCC's stack — OHTTP, blind signatures, secure enclaves, code attestation — is a reusable blueprint for making remote AI compute provably private.

Videos (19)

Prompt Engineering and AI Red Teaming — Sander Schulhoff, HackAPrompt/LearnPrompting

Prompt injection is a fundamentally unsolved AI-security problem and red-teaming competitions like HackAPrompt are the empirical baseline for measuring it.

12.9K views · Jul 14, 2025

How to Secure Agents using OAuth — Jared Hanson (Keycard, Passport.js)

Stop pasting long-lived API keys into MCP configs — treat MCP servers as OAuth resource servers and let a real authorization server mediate agent access.

7.9K views · Jul 30, 2025

$1 AI Guardrails: The Unreasonable Effectiveness of Finetuned ModernBERTs – Diego Carpentero

A fine-tuned ModernBERT classifier sidecar provides low-latency, sub-dollar guardrails against today's broad LLM attack surface across prompt, context, RAG, and MCP vectors.

6.3K views · Apr 16, 2026

Trust, but Verify: Shreya Rajpal

Treat every LLM call as untrusted and wrap it in a verification suite with reask/fix/refrain policies so correctness becomes a programmable property, not a hope.

4.7K views · Nov 25, 2023

Why you should care about AI interpretability - Mark Bissell, Goodfire AI

Interpretability tools like Goodfire's Ember are ready to use today — they let AI engineers debug and steer models at the neuron level instead of fighting prompts.

3.9K views · Jul 27, 2025

The Unofficial Guide to Apple's Private Cloud Compute - Jmo, CONFSEC

PCC's stack — OHTTP, blind signatures, secure enclaves, code attestation — is a reusable blueprint for making remote AI compute provably private.

3.2K views · Jul 30, 2025

How to Build Trustworthy AI — Allie Howe

Trustworthy AI requires shifting right with runtime guardrails alongside build-time scanning and red-teaming because non-determinism means CI/CD alone can't catch failures.

3.0K views · Jun 16, 2025

OpenAI on Securing Code-Executing AI Agents — Fouad Matin (Codex, Agent Robustness)

Secure code-executing agents require sandboxing, internet allowlists, and human approval surfaces — Codex CLI is OpenAI's open reference implementation.

2.8K views · Jul 30, 2025

How we hacked YC Spring 2025 batch's AI agents — Rene Brandel, Casco

Agents are users, not services — apply classic web-app authn/authz, IDOR mitigation and sandbox hardening downstream of the LLM, not at the prompt layer.

2.5K views · Jul 30, 2025

How to defend your sites from AI bots — David Mytton, Arcjet

Layered defenses (robots.txt + UA + reverse-DNS verification + behavioral signals) are needed to distinguish good AI crawlers from training/abuse bots and agent-driven browsers.

2.0K views · Jul 30, 2025

CIAM for AI: Authn/Authz for Agents — Michael Grinich, CEO of WorkOS

Agents need a new identity class — neither human nor service-account — with patterns like shadow personas, delegation chains, and capability tokens to keep them safe in enterprise systems.

1.9K views · Jul 21, 2025

AI + Security & Safety — Don Bosco Durai

Single-process agent frameworks violate zero-trust — credentials, prompts, and tool outputs need isolation boundaries to prevent injection-based privilege escalation.

1.4K views · Apr 19, 2025

LLM Safeguards: Security Privacy Compliance Anti Hallucination: Daniel Whitenack

Production LLM safety requires a layered checklist covering hallucination, supply chain, server resilience, PII leakage, and prompt injection — not a single guardrail.

1.3K views · Dec 31, 2024

Securing Agents with Open Standards — Bobby Tiernay and Kam Sween, Auth0

Use open identity standards (OAuth 2.1, token exchange, CIBA, FGA at retrieval) to keep agent actions tied to real users with scoped short-lived tokens, not env-var keys.

1.1K views · Jun 30, 2025

AI Frontiers in Trust and Safety Combatting Multifaceted Harm on Tinder at Scale: Vibhor Kumar

Fine-tuning open-source LLMs on hybrid LLM-mined + human-verified data is the practical playbook for trust-and-safety classification at consumer scale.

1.1K views · Dec 02, 2024

AI Red Teaming Agent: Azure AI Foundry — Nagkumar Arkalgud & Keiji Kanazawa, Microsoft

Azure AI Foundry packages Microsoft's PyRIT red-teaming toolkit as a managed SDK + dashboard so AI engineers can red-team their own apps without standing up the framework themselves.

802 views · Jun 27, 2025

Critical AI Inference your CIO can Trust — Sahil Yadav, Hariharan Ganesan, Telemetrak

For CIOs to trust AI in critical systems, MLOps needs to evolve into XTOps with explainability, traceability and guardrails plus business-facing metrics like MTRE and trust-adjusted risk.

680 views · Jul 22, 2025

Building security around ML: Dr. Andrew Davis

ML security has moved from anti-malware-style ML to defending the ML pipeline itself — poisoning, theft, and adversarial examples remain open problems.

434 views · Feb 08, 2025

Cognitive Shield Real Time Real Smart - Rachna Srivastava

AI-powered fraud requires AI-powered defenses—layered detection (data hygiene + multi-modal real-time + human escalation) rebuilds trust against deepfakes and synthetic identities.

305 views · Jun 03, 2025