How doccape Works: A Look Behind the Scenes of Our Anonymization
doccape combines rule-based methods with AI to reliably detect sensitive data in texts, images, and embedded content. Consistent pseudonyms preserve context — ideal for analytics and AI applications. The solution is customizable and can be deployed locally or on EU servers — for maximum control and privacy.

Table of Contents
Share on Social Media
Privacy protection — but make it smart!
In the digital age, characterized by exponentially growing data volumes and technological innovation, the protection of personal data plays a decisive role. Data privacy and data security are more relevant than ever — especially against the backdrop of regulatory requirements such as the General Data Protection Regulation (GDPR) and the EU AI Act, which sets clear standards for the responsible use of artificial intelligence.
Pseudonymization and anonymization with consistent replacement occupy a central position in this landscape: personal information is altered so that individuals can no longer be directly identified, while the data remains usable. This not only enables privacy-compliant use of sensitive information, but also significantly reduces the risk of unintended data leaks or misuse. Particularly when using generative AI — for example in chatbots or fully automated processes — processing with consistent replacement is becoming increasingly important, as large volumes of sensitive data are often involved.
What makes context preservation so complex?
With anonymization, any possibility of identifying a person is permanently excluded — simple detection is sufficient for this. Consistent-replacement processing pursues a different goal: identity is concealed, but texts remain "readable" and usable. For example, both "Mr. Müller" and "Peter Müller" become "PER-42".
For this to work, all entities — such as persons or places — must be consistently replaced throughout the entire text. Different people sharing the same surname must not be merged. So "Mr. Müller" might refer to the son in one passage and the father in another — this distinction must be preserved.
What can even be tricky for humans is an even greater challenge for machines — especially in long texts where relevant references are far apart. The reason lies in the architecture of many language models: they can only process a limited amount of text at once — the so-called context window. Keeping track of all mentioned entities across a long document quickly becomes difficult.
On top of this, large language models tend to hallucinate. While they can generally analyze longer texts, the original content may be inadvertently altered in the process. The more entities are involved and the more complex their relationships, the more even powerful models reach their limits.
How we solve these challenges in doccape
doccape combines classical rule-based methods with modern AI technologies. This allows the system to reliably detect personal and sensitive information in texts and images. By using consistent pseudonyms, relationships within the text are preserved — a crucial advantage for further analysis and AI applications. The underlying principle: data protection and usability are not mutually exclusive — quite the opposite.
The three phases of processing
Below we provide a detailed insight into the underlying technology, which ensures both precise results and compliance with regulatory requirements.
1. Extraction
In the extraction phase, raw text or image data is extracted from a document such as a PDF file. The document is decomposed, with the exact position of each element stored to enable correct reconstruction later.
2. Detection
The detection of personal data forms the core of the processing. Extracted data is analyzed using specialized AI models that identify relevant positions. In the image domain, this creates so-called bounding boxes that, for example, frame a detected face. In the text domain, relevant passages — such as names — are identified and prepared for later assignment to unique pseudonyms. The underlying methods are explained in more detail below.
2.1. Image recognition
In the area of computer vision, the solution is based on modern object detection technology, which has been dominated by neural networks since the milestone AlexNet (2012). Formerly, distinctive structures (so-called "features" such as circles and lines in character recognition) were defined manually. Today, AI models learn these automatically. In particular, YOLO models — continuously refined in recent years — offer fast and precise results, though they can struggle with densely packed objects such as faces in a crowd. Transformer-based models currently provide the highest accuracy but demand greater computational resources. We combine these technologies to ensure maximum precision at acceptable processing times.

2.2. Entity detection
In addition to rule-based methods — which we use for example with white- and blacklists — we rely primarily on the advanced text understanding of modern natural language processing models. The success of these models is based on the Transformer architecture from 2017, which forms the foundation of today's AI systems such as ChatGPT. The "intelligence" of such language models emerges from training on terabytes of fill-in-the-blank texts. The model learns which words appear in the context of others and is essentially forced, by the enormous amount of data, to develop a smarter strategy than simply memorizing all texts. This contextual understanding enables our model to distinguish different entity types such as persons, places, or account numbers. For example, the word "Baker" could refer to a person or a shop depending on context. Unlike generative language models used in ChatGPT, our approach uses a lean encoder-only model specialized in Named Entity Recognition (NER), which is precise, memory-efficient, and ideal for on-premise deployments. Hallucinations that could distort content are fully excluded. We will discuss in detail why we do not use a generative language model in a separate upcoming blog post.
2.3. Assignment

The assignment of detected personal data to unique pseudonyms is also AI-based. Our model generates so-called embeddings — numerical representations containing all relevant information from the language model's perspective for a given detection, e.g. "Mr. Müller". By comparing these embeddings, similarities and differences can be identified, effectively grouping the detected personal data. In linguistics, this method is known as Coreference Resolution (CR), where linguistic expressions referring to the same entity are linked together.
3. Anonymization

In the final step, recognized and assigned personal data in texts and images is replaced by pseudonyms or fully masked — for example by pixelation or blackening. We always create a new document or image to ensure that metadata contains no sensitive information either. Hyperlinks containing sensitive information are also reliably removed.
Real use cases, real security
Our technology is already deployed across various market segments: doccape supports the automated processing of large volumes of personal documents in medicine, expert assessment, and financial services. Our approach is particularly effective when tuned to real customer datasets — by optimizing for industry-specific content, we significantly improve the detection and assignment accuracy of our product. This creates tailored solutions that adapt precisely to each use case. Our software is equally applicable in diverse settings, such as internal chatbots like OpenWebUI or AI workflow automation — for example, our n8n module. With doccape, sensitive inputs can be processed — securely, efficiently, and with full control over confidential content.
Top-level privacy — locally or in the EU
With doccape, you retain full control over your sensitive data: whether deployed locally in your own infrastructure or securely hosted on EU servers — you decide where your data is processed. Thanks to Privacy by Design, every processing operation is GDPR-compliant, traceable, and protected from unauthorized access. This combines maximum data security with modern AI — without compromise.
When privacy meets efficiency
Simply making sensitive data unusable? You can — but it's pointless if you still want to do something with the data. That's exactly why more and more organizations are turning to anonymization and pseudonymization with consistent replacement. It protects personal information while preserving semantic context. Data can continue to be analyzed, evaluated, and used — entirely without risk to privacy.
Want to see how this works in practice? 👉 Take a look at our product page!
Want to use doccape in your organization?
In a short introductory call we’ll help you assess fit, privacy and rollout options — on-premise, hybrid or in the cloud.
