Document Anonymization for AI
Use any public LLM safely. We detect and replace sensitive data in your documents before they reach OpenAI, Claude, or Gemini. Names, IDs, account numbers — masked deterministically, restored seamlessly in the response.
What's included
PII Detection Engine
Named entity recognition (NER) combined with custom regex patterns. Detects names, Emirates IDs, passport numbers, IBANs, TRNs, addresses, phone numbers, and emails in English and Arabic.
Deterministic Tokenization
Same value always maps to the same token within a session. The LLM sees consistent references, so reasoning across the document works correctly without exposing real data.
LLM Gateway
Drop-in proxy for OpenAI, Anthropic, Google, Cohere, Mistral, or your own self-hosted models. Your code calls our gateway; we handle masking, routing, and re-identification.
Response Re-Identification
Tokens in the LLM's response are mapped back to original values before delivery to your application. The user sees real names; the LLM only ever saw tokens.
Multi-Format Support
Works with plain text, PDFs, DOCX, Excel, JSON, and API payloads. Connects to your existing document pipeline with built-in OCR for scanned files.
Audit Trail
Full log of what was masked, when, by whom, and which LLM was called. Required for bank compliance, GDPR records of processing, and internal data governance reviews.
How it works
Four-stage pipeline that sits between your application and the LLM.
1. Detect
Document is scanned for PII using NER models (spaCy, Presidio) and regex patterns tuned for UAE / GCC identifiers (Emirates ID, TRN, UAE phone formats, IBAN ranges).
2. Mask
Each detected entity is replaced with a deterministic token ([PERSON_1], [EID_1], etc). Same entity gets the same token throughout the document so the LLM can still reason about relationships.
3. Process
Masked content is sent to your chosen LLM (OpenAI, Claude, Gemini, or local). Only tokens cross the boundary — no real PII ever leaves your perimeter.
4. Restore
Response is scanned for tokens, which are mapped back to original values using the in-memory mapping. Your application receives a fully restored answer.
Detected Entities
Out-of-the-box detection for common PII. Custom patterns trained on request (proprietary IDs, internal references, industry-specific identifiers).
Compatible with any LLM
Our gateway routes to any major provider or your own self-hosted model. Switch providers without changing application code.
Who needs document anonymization
Banks & Compliance Teams
Use GPT-4 or Claude for AML narrative review, SAR drafting, and transaction analysis without exposing customer identities, account numbers, or Emirates IDs to external providers.
Law Firms
Analyze contracts and case files with public LLMs while preserving attorney-client privilege. Client names, opposing parties, and matter details masked before any external API call.
Healthcare Providers
Use AI to summarize patient records, generate referral letters, or extract clinical data while staying compliant with DHA and DOH rules on PHI handling.
Accounting & Audit
Process client invoices, ledgers, and BSA reports through AI for categorization and anomaly detection — without disclosing client names, vendor relationships, or TRNs.
HR & Recruitment
Run CV screening and candidate matching through GPT or Claude with all personal identifiers stripped. Reduce bias risk and meet candidate data protection expectations.
Consulting Firms
Analyze client documents with frontier AI models while honoring strict NDAs. Project codes and company names are tokenized before any third-party model sees the content.
Technical Details
NER + regex detection, deterministic HMAC-based tokenization, and full audit logging. Deployed as managed cloud or self-hosted gateway.
# Anonymization Pipeline
Detection Layer:
NER Models: spaCy multilingual, Presidio
Regex: UAE EID, TRN, IBAN, passport
Languages: English + Arabic
Custom: Per-tenant patterns
Tokenization:
Strategy: HMAC-SHA256 deterministic
Format: [ENTITY_TYPE_N]
Consistency: Session-scoped (same value -> same token)
Reversibility: In-memory mapping, optional encrypted persist
Gateway:
Providers: OpenAI, Anthropic, Google, Cohere,
Mistral, self-hosted (Ollama, vLLM)
Protocol: OpenAI-compatible API surface
Streaming: SSE with on-the-fly re-identification
Compliance:
Audit: Full request/response log (masked)
Retention: Configurable per tenant
Deployment: Managed cloud / self-hosted / hybrid
Standards: UAE PDPL, GDPR, ADHICS-readyEvery project is different
Pricing depends on document volume, entity types, and whether you need managed cloud or self-hosted gateway. We'll size the right setup for your compliance requirements.
Ready to use AI without leaking data?
Tell us which LLMs you want to use and what data you need protected. Free initial consultation.