- What outcome are we optimizing for? β Legitimate-user auth success rate (authorized users can log in, every time, fast) WHILE maintaining zero unauthorized access. These are competing: tighter security β more friction β more legitimate failures. Secondary: time-to-productivity for new employees (onboarding speed) and time-to-lockout for departing employees (deprovisioning speed). This shapes the policy engine β it must be expressive enough to distinguish "unusual but legitimate" from "compromised account" without blocking real users.
- Core capabilities? β Single Sign-On (SSO), Multi-Factor Authentication (MFA), Universal Directory (user/group management), Lifecycle Management (provisioning/deprovisioning), API Access Management (OAuth2/OIDC tokens).
- Who are the tenants? β ~19,000+ customer organizations, from 50-person startups to Fortune 500 enterprises. Each tenant is completely isolated. Tenant data MUST NEVER cross boundaries.
- Federation protocols? β SAML 2.0 (still dominant for enterprise), OpenID Connect/OAuth 2.0 (modern apps), WS-Federation (Microsoft legacy), SCIM (user provisioning).
- Integration catalog? β Okta Integration Network (OIN): 7,500+ pre-built integrations (Salesforce, Slack, AWS, Office 365, etc.) with SSO and provisioning.
- On-prem connectivity? β Okta AD Agent: lightweight agent installed in customer's network that bridges on-prem Active Directory / LDAP to cloud Okta. Outbound-only connections (no inbound firewall holes).
- Scale? β Hundreds of millions of authentications per day across all tenants. Sub-200ms auth latency. 99.99% availability SLA.
| In Scope | Out of Scope |
|---|---|
| Authentication pipeline (login flow) | Okta Privileged Access (PAM) |
| SSO (SAML + OIDC) federation | Identity Governance & Admin (IGA) |
| Multi-factor authentication (MFA) | Advanced Threat Detection analytics |
| Multi-tenant data isolation | Customer Identity (CIAM) specifics |
| Universal Directory & user store | Passwordless / FIDO2 protocol details |
| Directory sync (AD/LDAP agents) | Admin console UI |
| Token issuance (OAuth2/OIDC) | Pricing / billing system |
- UC1 (Employee SSO Login): Employee at Acme Corp opens Salesforce β redirected to Okta (acme.okta.com) β enters username/password β Okta verifies credentials β MFA challenge (push to Okta Verify app) β user approves β Okta issues SAML assertion β Salesforce grants access. Total: <5 seconds including MFA.
- UC2 (Session-based SSO): Same employee, 10 minutes later, opens Slack β redirected to Okta β Okta recognizes EXISTING session (cookie) β no password prompt β issues SAML assertion for Slack β Slack grants access. Total: <1 second (transparent redirect).
- UC3 (New Employee Onboarding): HR creates user in Workday β SCIM provisioning triggers in Okta β Okta creates user, assigns to groups based on department β auto-provisions accounts in Salesforce, Slack, GitHub, AWS (each via SCIM or API) β employee has access to all apps from day one.
- UC4 (Employee Offboarding): HR deactivates user in Workday β Okta deactivates user β IMMEDIATELY revokes all active sessions β deprovisions from all connected apps β within minutes, employee is locked out of EVERYTHING. This is the most security-critical lifecycle event.
- UC5 (Suspicious Login): Login attempt from unusual location (user normally in NYC, attempt from Eastern Europe) β Okta's adaptive MFA policy triggers additional factor β user can't produce the factor β access denied β security event logged β admin alerted.
- Availability: 99.99% (52 min downtime/year max): If Okta is down, employees at 19,000+ companies can't log in to ANY of their applications. Okta downtime = enterprise-wide work stoppage for customers.
- Latency: <200ms for auth decisions: Authentication sits on the critical path of every login. A slow Okta makes every SaaS app feel slow. Users blame the app, not the identity provider.
- Security: zero data leakage between tenants: Acme Corp's user directory must NEVER be visible to Beta Corp. This is contractual, regulatory, and existential β a cross-tenant leak would destroy trust in the platform.
- Auditability: Every authentication, MFA challenge, token issuance, provisioning action, admin change logged with immutable audit trail per tenant. Required for SOC 2, FedRAMP, HIPAA compliance.
- Consistency: Strong consistency for auth decisions. If a user is deactivated at T=0, no authentication should succeed at T=0+1ms. Stale reads here = security vulnerability.
| Requirement | Decision | Why (and what was rejected) | Consistency |
|---|---|---|---|
| 99.99% availability β downtime = enterprise work stoppage | Multi-region active-active | Each region independently serves auth requests. If one region fails, others absorb traffic. Single-region = 53 min downtime/year max. | AP for routing, CP for auth decisions |
| Auth decisions must be consistent (deprovisioned = blocked) | Strong consistency for sessions + auth | A deprovisioned user must be blocked within 30 seconds globally. Eventual consistency could allow fired employees to access systems. | CP |
| 19,000 tenants, strict data isolation | Shared DB sharded by org_id (not DB-per-tenant) | 19K databases = 19K to patch, back up, monitor. Shared DB + mandatory org_id filter + Row-Level Security. Tradeoff: isolation depends on app correctness. | β |
| <200ms auth latency including MFA | Redis for session tokens (not DB lookup) | Session validation is the hottest path. Redis <1ms vs PostgreSQL 5-20ms. 10x latency reduction on every authenticated request. | β |
| SAML/OIDC tokens must be cryptographically signed | Per-tenant signing keys in HSM | Hardware Security Module: private keys never leave the HSM. Even Okta engineers can't extract them. Required for enterprise trust. | β |
| Federated with 200+ SaaS apps | Standard protocols (SAML, OIDC, SCIM) not custom APIs | Standards mean any compliant SP works without custom integration. Custom APIs would require per-SP adapters β doesn't scale to 200+ apps. | β |
π Authentication Pipeline HOT PATH
- Credential verification (password, delegated auth)
- MFA orchestration (push, TOTP, SMS, WebAuthn)
- Adaptive risk scoring (location, device, behavior)
- Session creation and management
π Universal Directory STORE
- User profiles, credentials, group memberships
- Per-tenant isolated data store
- Custom attributes / schema per tenant
- Mastered by: Okta, AD, LDAP, HR system, or SCIM
βοΈ Policy Engine RULES
- Sign-on policies (which factors, when)
- App-level access policies
- Network zones (trusted IPs / VPN)
- Adaptive MFA rules (risk-based)
π« Token / Assertion Service CRYPTO
- SAML 2.0 assertion generation & signing
- OAuth2 / OIDC token issuance (access, ID, refresh)
- Per-tenant signing keys (RSA/EC key pairs)
- Token introspection & revocation
π Provisioning Engine LIFECYCLE
- User create/update/deactivate across apps
- SCIM 2.0 protocol for app provisioning
- Group push (Okta group β app role)
- Import from HR systems (Workday, BambooHR)
π Agent Gateway HYBRID
- Connects on-prem AD/LDAP to cloud Okta
- Outbound-only (no inbound firewall rules)
- Delegated auth: Okta forwards to AD via agent
- Real-time sync: AD changes β Okta within minutes
π Audit & System Log AUDIT
- Every auth event, admin action, provisioning event
- Immutable append-only log per tenant
- Streaming to customer SIEM (Splunk, etc.)
- SOC 2, FedRAMP, HIPAA compliance
π‘οΈ Threat Detection GUARD
- Credential stuffing detection
- Bot detection (CAPTCHAs, device fingerprint)
- Impossible travel detection
- Brute-force rate limiting per user & IP
Credential Verification Modes
| Mode | How It Works | Latency | When Used |
|---|---|---|---|
| Okta-mastered | User's password hash stored in Okta's directory. Bcrypt comparison on login. | 50-150ms (bcrypt is deliberately slow to resist brute-force) | Cloud-native orgs with no on-prem AD |
| Delegated to AD | Okta forwards credentials to on-prem AD Agent β agent authenticates against AD β returns result. | 200-500ms (network round-trip to customer's data center) | Enterprises with existing AD infrastructure |
| Delegated to LDAP | Similar to AD but via LDAP bind. | 200-500ms | Non-Microsoft directory environments |
| Social / External IdP | Redirect to Google/Azure AD/etc. They verify credentials. Okta receives assertion. | 1-3 seconds (browser redirects) | Customer identity (CIAM), B2B federation |
| Passwordless | Email magic link, Okta Verify biometric, or FIDO2 WebAuthn. No password involved. | Varies (user interaction) | Modern security-conscious orgs |
MFA Orchestration
Session Management
- Okta session: After successful authentication, Okta creates a session identified by a secure cookie (
sid) on theacme.okta.comdomain. Subsequent SSO redirects check this cookie β if valid session exists, no re-authentication needed. - Session lifetime: Configured per tenant policy: "Session expires after 12 hours" or "Session expires after 30 minutes of inactivity." Admins balance security (short sessions) vs. UX (fewer re-auth prompts).
- Session storage: Distributed cache (Redis cluster) for <5ms lookup. Session data: user_id, tenant_id, auth_time, factors_used, device_info. Must be strongly consistent β session revocation must propagate immediately.
- Global session revocation: When user is deactivated β delete ALL session records for that user β ANY subsequent SSO redirect finds no session β re-auth required β credentials invalid β access denied. This must happen in milliseconds, not minutes.
- Tenant provisioning: New customer signs up β create tenant_id, insert into config table, generate signing key pair, create admin user. Total time: <30 seconds. No database creation needed.
- Tenant deletion (rare): Cascade-delete all rows with matching tenant_id across all tables. Revoke all signing keys. Purge from all caches. This is a dangerous operation β double-confirmed, audit-logged, and irreversible.
- Compliance isolation: Some tenants (government, healthcare) require data residency guarantees. These tenants are assigned to region-specific database shards (e.g., US-only, EU-only). The routing layer ensures their data never leaves the designated region.
| Protocol | Token Format | When Used | Key Security Property |
|---|---|---|---|
| SAML 2.0 | XML, RSA-signed | Enterprise SaaS (Salesforce, Workday, ServiceNow) | Assertion signed with tenant's private key. SP verifies with Okta's public cert. Replay protection via NotOnOrAfter + InResponseTo. |
| OIDC | JWT (JSON), RS256-signed | Modern apps, SPAs, mobile apps | ID token is a signed JWT. Access token for API access. Refresh token for long-lived sessions. PKCE for public clients. |
| OAuth 2.0 | Opaque or JWT access token | API authorization (machine-to-machine) | Scoped access (read:users vs. write:users). Short-lived access tokens (1 hour). Client credentials for service accounts. |
| WS-Fed | XML / SAML-like | Microsoft ecosystem (older O365, ADFS) | Being deprecated in favor of OIDC. Still needed for legacy Windows environments. |
| SWA | N/A (form fill) | Legacy apps with no federation | Okta browser plugin auto-fills username/password. Least secure but only option for non-federated apps. |
| Data | Store | Why This Store |
|---|---|---|
| Identity records | PostgreSQL (sharded) | User profiles, credentials, group memberships. Sharded by org_id for tenant isolation. Encrypted at rest. |
| Policy rules | PostgreSQL | Authentication policies, MFA rules, conditional access. Read-heavy β cached aggressively. Versioned for audit. |
| Session tokens | Redis (clustered) | Active SSO sessions. TTL-based expiration. Distributed across regions for low-latency validation. |
| Audit logs | Append-only store | Every authentication event, admin action, policy change. Immutable. 90-day online retention, 7-year archive. |
| SAML/OIDC metadata | PostgreSQL + cache | SP certificates, redirect URIs, signing keys. Cached per-org. Invalidated on admin changes. |
| Directory sync state | PostgreSQL | Last sync cursor per AD/LDAP connection. Conflict resolution for bidirectional sync. |
- Defense in depth: Every layer independently enforces tenant isolation. Even if the application layer has a bug, database RLS prevents cross-tenant reads. Even if the database is compromised, per-tenant encryption keys limit blast radius.
- HSM for signing keys: Tenant signing keys stored in FIPS 140-2 Level 3 Hardware Security Modules. Private keys never leave the HSM β signing operations are performed INSIDE the HSM. Even Okta engineers cannot extract key material.
- Password storage: Bcrypt with cost factor 12 (or scrypt/argon2). Never stored in plaintext, never logged, never transmitted except to the hashing function. Database compromise reveals only hashes β each hash requires ~100ms to brute-force ONE guess.
- Zero standing access: Okta engineers don't have permanent access to production systems or customer data. Access is granted just-in-time via approval workflow, time-limited, and fully audit-logged. Access to customer data requires customer consent.
- SOC 2 Type II, FedRAMP, ISO 27001: Regular third-party audits of security controls. Audit reports available to customers. Certifications are a contractual requirement for enterprise sales.
- Per-tenant metrics: Auth success rate, auth latency (p50/p95/p99), MFA challenge rate, factor success rate. A sudden drop in auth success rate for one tenant = possible attack or configuration error.
- Global metrics: Total auth/sec, session count, token issuance rate, agent connectivity status. These drive capacity planning and incident detection.
- System Log (per tenant): Every event: login success/failure, MFA challenge, token issuance, admin change, provisioning action. Structured JSON, queryable via API, streamable to customer's SIEM. Retention: 90 days in Okta, longer via SIEM integration.
- Status page: Public status page (status.okta.com) with per-service health. Enterprise customers subscribe to real-time incident notifications. Transparency in outages builds trust β hiding issues destroys it.
| Extension | Architecture Impact |
|---|---|
| Continuous Access Evaluation | Instead of "authenticate once, trust for 8 hours," continuously re-evaluate risk signals during a session: device posture changes, impossible travel detected, IP reputation drops. Requires a streaming risk-scoring pipeline that consumes signals from endpoints, network, and cloud apps β and can revoke sessions mid-flight. |
| Passkey / FIDO2 Everywhere | Eliminate passwords entirely. Users authenticate with biometrics (Face ID, Touch ID) or security keys. Requires: WebAuthn credential storage per-user-per-device, cross-device sync (Apple/Google passkey sync), and a migration path from password-based to passwordless for existing tenants. |
| Identity Threat Detection (ITDR) | ML-based detection of compromised identities: token replay, session hijacking, lateral movement. Requires: behavioral baselining per-user (normal login times, locations, apps), anomaly detection model, real-time alerting, and automated response (lock account, terminate sessions, require re-auth). |
| Decentralized Identity (Verifiable Credentials) | Users hold cryptographic credentials in a digital wallet (W3C Verifiable Credentials). Present proof of identity without revealing all attributes ("prove I'm over 21 without revealing my birthdate"). Requires: new trust model (issuer β holder β verifier), wallet integration, and selective disclosure crypto. |
| Fine-Grained Authorization (Zanzibar-style) | Move beyond "can this user access this app?" to "can this user edit this specific document?" Relationship-based access control at the resource level, like Google's Zanzibar. Requires: a policy evaluation engine that can handle millions of authorization checks per second with sub-10ms latency, plus a graph database for relationship storage. |
What happens if Okta goes down? How do you achieve 99.99% availability?
If Okta goes down, employees at 19,000+ companies can't log in to any of their cloud applications β it's an enterprise-wide work stoppage. 99.99% means <53 minutes of downtime per year. We achieve this through: (1) multi-region active-active deployment β each region can independently serve auth requests, (2) session persistence β if the auth service is briefly unavailable, users with existing sessions continue working (sessions are validated against Redis, not the auth DB), (3) SAML/OIDC responses are signed with keys cached at the SP side, so SPs can validate tokens even during brief Okta outages. The most critical failover scenario is database failure: we use synchronous replication to a standby with automatic failover (<30 seconds). During those 30 seconds, new logins fail but existing sessions work. For planned maintenance, we do rolling deploys with zero-downtime migrations. The honest answer: achieving 99.99% is as much about operational discipline (change management, runbooks, chaos testing) as it is about architecture.
How do you prevent a compromised password at one SP from affecting others?
This is a core benefit of SSO β the user's password is ONLY stored and verified by Okta, never by individual SPs. When a user logs into Salesforce via Okta SSO, Salesforce never sees the password. It receives a signed SAML assertion saying "Okta verified this user at this time." If Salesforce is compromised, the attacker gets Salesforce data but NOT the user's Okta password. They can't use the breach to access Slack, AWS, or any other SP. Conversely, if the user's Okta password is compromised, the blast radius IS every SP β which is why Okta enforces MFA, credential stuffing detection, and password policy enforcement at the IdP layer. The security model is: centralize the high-value target (credentials) in a purpose-built, hardened system rather than distributing it across dozens of SPs with varying security postures.
How does user deprovisioning work across 200+ connected applications?
When HR deactivates an employee in the source directory (Workday, AD), Okta's provisioning engine must propagate that change to every connected application. The pipeline is: (1) directory sync detects the change (via SCIM webhook from Workday, or polling AD every 5 minutes), (2) Okta marks the user as "deactivated" in the Universal Directory, (3) ALL active sessions for that user are immediately revoked (delete from Redis), (4) for each connected SP that supports SCIM, Okta sends a SCIM PATCH to deactivate the user account, (5) for SPs without SCIM (legacy apps), Okta can disable via API or flag for manual review. Steps 4-5 are asynchronous and may take 5-60 seconds per SP. The SLA we guarantee: session revocation (step 3) within 30 seconds, SCIM deprovisioning within 5 minutes. The risk window is the time between deactivation and SCIM propagation β during this window, the user can't START new sessions (Okta blocks auth) but EXISTING sessions at SPs might still be valid until they expire or the SP processes the SCIM request.
How do you handle multi-tenant isolation in a shared infrastructure?
Every query includes org_id as a mandatory filter β there is no code path that can accidentally return data across organizations. This is enforced at multiple layers: (1) database: sharded by org_id, so a query physically cannot scan another org's data, (2) application: middleware injects org_id from the authenticated session into every DB query β developers can't forget it, (3) API: every API endpoint validates that the requested resource belongs to the caller's org, (4) cache: Redis keys are prefixed with org_id, so cache poisoning across tenants is impossible. For noisy neighbor protection: rate limits are per-org, so one organization running a bulk user import can't degrade auth latency for other orgs. CPU-intensive operations (bcrypt hashing) run in isolated thread pools per org. We regularly run penetration tests specifically targeting cross-tenant data leakage β it's the highest-severity finding category.