Entity removal

Entity removal is a local, non-LLM preprocessing step that deletes named entities from text before downstream analysis. It is designed to reduce topic fragmentation caused by proper nouns and identifiers (for example, names, addresses, policy numbers).

This utility is optional and deterministic. It relies on a local spaCy model and does not send text to external services.

When to use it

Use entity removal when you want topic modeling or clustering to focus on intent rather than unique values. Common examples include:

  • Removing addresses, policy numbers, and names in customer-service transcripts.

  • Collapsing location-specific variants into a single intent bucket.

  • Cleaning entity-heavy corpora where proper nouns dominate keywords.

Installation

Entity removal uses spaCy. Install the optional dependency and the model:

pip install "biblicus[ner]"
python -m spacy download en_core_web_sm

Configuration

Entity removal is configured under topic modeling and runs after LLM extraction (if enabled) and before lexical processing.

entity_removal:
  enabled: true
  provider: spacy
  model: en_core_web_sm
  entity_types:
    - PERSON
    - GPE
    - LOC
    - ORG
    - FAC
    - DATE
    - TIME
    - MONEY
    - PERCENT
    - CARDINAL
    - ORDINAL
  replace_with: ""
  collapse_whitespace: true
  regex_patterns:
    - "\\b\\d{5}(-\\d{4})?\\b"  # zip codes
    - "\\b\\d{3}[-. ]?\\d{3}[-. ]?\\d{4}\\b"  # phone numbers
  regex_replace_with: ""

Notes

  • When entity_types is empty, Biblicus uses a default set of common entity labels.

  • Regex patterns run after NER removal.

  • If replace_with is empty, entities are deleted.

Output visibility

Topic modeling reports include an entity_removal report entry showing:

  • provider and model used

  • entity types removed

  • input/output document counts

  • regex patterns applied

When enabled, Biblicus writes entity_removal.jsonl alongside the topic modeling output.json file. This artifact contains the redacted documents used for clustering and is reused when rerunning the same snapshot.