Document Anonymization for AI

Use any public LLM safely. We detect and replace sensitive data in your documents before they reach OpenAI, Claude, or Gemini. Names, IDs, account numbers — masked deterministically, restored seamlessly in the response.

See LLM Service
Original Document
Awaiting document
Sent to LLM
Waiting for anonymization...
Status
--
PII masked
--
PII leaked
--
Latency

What's included

PII Detection Engine

Named entity recognition (NER) combined with custom regex patterns. Detects names, Emirates IDs, passport numbers, IBANs, TRNs, addresses, phone numbers, and emails in English and Arabic.

Deterministic Tokenization

Same value always maps to the same token within a session. The LLM sees consistent references, so reasoning across the document works correctly without exposing real data.

LLM Gateway

Drop-in proxy for OpenAI, Anthropic, Google, Cohere, Mistral, or your own self-hosted models. Your code calls our gateway; we handle masking, routing, and re-identification.

Response Re-Identification

Tokens in the LLM's response are mapped back to original values before delivery to your application. The user sees real names; the LLM only ever saw tokens.

Multi-Format Support

Works with plain text, PDFs, DOCX, Excel, JSON, and API payloads. Connects to your existing document pipeline with built-in OCR for scanned files.

Audit Trail

Full log of what was masked, when, by whom, and which LLM was called. Required for bank compliance, GDPR records of processing, and internal data governance reviews.

How it works

Four-stage pipeline that sits between your application and the LLM.

1. Detect

Document is scanned for PII using NER models (spaCy, Presidio) and regex patterns tuned for UAE / GCC identifiers (Emirates ID, TRN, UAE phone formats, IBAN ranges).

2. Mask

Each detected entity is replaced with a deterministic token ([PERSON_1], [EID_1], etc). Same entity gets the same token throughout the document so the LLM can still reason about relationships.

3. Process

Masked content is sent to your chosen LLM (OpenAI, Claude, Gemini, or local). Only tokens cross the boundary — no real PII ever leaves your perimeter.

4. Restore

Response is scanned for tokens, which are mapped back to original values using the in-memory mapping. Your application receives a fully restored answer.

Detected Entities

Out-of-the-box detection for common PII. Custom patterns trained on request (proprietary IDs, internal references, industry-specific identifiers).

Person NamesEmirates IDPassport NumbersIBANTRNBank AccountsPhone NumbersEmail AddressesPhysical AddressesDates of BirthCompany NamesCustom Patterns

Compatible with any LLM

Our gateway routes to any major provider or your own self-hosted model. Switch providers without changing application code.

OpenAI GPTAnthropic ClaudeGoogle GeminiMistralCohereLocal LLMs

Who needs document anonymization

Banks & Compliance Teams

Use GPT-4 or Claude for AML narrative review, SAR drafting, and transaction analysis without exposing customer identities, account numbers, or Emirates IDs to external providers.

Law Firms

Analyze contracts and case files with public LLMs while preserving attorney-client privilege. Client names, opposing parties, and matter details masked before any external API call.

Healthcare Providers

Use AI to summarize patient records, generate referral letters, or extract clinical data while staying compliant with DHA and DOH rules on PHI handling.

Accounting & Audit

Process client invoices, ledgers, and BSA reports through AI for categorization and anomaly detection — without disclosing client names, vendor relationships, or TRNs.

HR & Recruitment

Run CV screening and candidate matching through GPT or Claude with all personal identifiers stripped. Reduce bias risk and meet candidate data protection expectations.

Consulting Firms

Analyze client documents with frontier AI models while honoring strict NDAs. Project codes and company names are tokenized before any third-party model sees the content.

Technical Details

NER + regex detection, deterministic HMAC-based tokenization, and full audit logging. Deployed as managed cloud or self-hosted gateway.

NER + regex
Deterministic tokens
Any LLM provider
Auto re-identification
Multi-format
Audit logging
# Anonymization Pipeline

Detection Layer:
  NER Models:   spaCy multilingual, Presidio
  Regex:        UAE EID, TRN, IBAN, passport
  Languages:    English + Arabic
  Custom:       Per-tenant patterns

Tokenization:
  Strategy:     HMAC-SHA256 deterministic
  Format:       [ENTITY_TYPE_N]
  Consistency:  Session-scoped (same value -> same token)
  Reversibility: In-memory mapping, optional encrypted persist

Gateway:
  Providers:    OpenAI, Anthropic, Google, Cohere,
                Mistral, self-hosted (Ollama, vLLM)
  Protocol:     OpenAI-compatible API surface
  Streaming:    SSE with on-the-fly re-identification

Compliance:
  Audit:        Full request/response log (masked)
  Retention:    Configurable per tenant
  Deployment:   Managed cloud / self-hosted / hybrid
  Standards:    UAE PDPL, GDPR, ADHICS-ready

Every project is different

Pricing depends on document volume, entity types, and whether you need managed cloud or self-hosted gateway. We'll size the right setup for your compliance requirements.

Ready to use AI without leaking data?

Tell us which LLMs you want to use and what data you need protected. Free initial consultation.

See LLM Service