Entity removal
Entity removal is a local, non-LLM preprocessing step that deletes named entities from text before downstream analysis. It is designed to reduce topic fragmentation caused by proper nouns and identifiers (for example, names, addresses, policy numbers).
This utility is optional and deterministic. It relies on a local spaCy model and does not send text to external services.
When to use it
Use entity removal when you want topic modeling or clustering to focus on intent rather than unique values. Common examples include:
Removing addresses, policy numbers, and names in customer-service transcripts.
Collapsing location-specific variants into a single intent bucket.
Cleaning entity-heavy corpora where proper nouns dominate keywords.
Installation
Entity removal uses spaCy. Install the optional dependency and the model:
pip install "biblicus[ner]"
python -m spacy download en_core_web_sm
Configuration
Entity removal is configured under topic modeling and runs after LLM extraction (if enabled) and before lexical processing.
entity_removal:
enabled: true
provider: spacy
model: en_core_web_sm
entity_types:
- PERSON
- GPE
- LOC
- ORG
- FAC
- DATE
- TIME
- MONEY
- PERCENT
- CARDINAL
- ORDINAL
replace_with: ""
collapse_whitespace: true
regex_patterns:
- "\\b\\d{5}(-\\d{4})?\\b" # zip codes
- "\\b\\d{3}[-. ]?\\d{3}[-. ]?\\d{4}\\b" # phone numbers
regex_replace_with: ""
Notes
When
entity_typesis empty, Biblicus uses a default set of common entity labels.Regex patterns run after NER removal.
If
replace_withis empty, entities are deleted.
Output visibility
Topic modeling reports include an entity_removal report entry showing:
provider and model used
entity types removed
input/output document counts
regex patterns applied
When enabled, Biblicus writes entity_removal.jsonl alongside the topic modeling output.json file. This artifact
contains the redacted documents used for clustering and is reused when rerunning the same snapshot.