# Entity removal

Entity removal is a local, non-LLM preprocessing step that deletes named entities from text before downstream
analysis. It is designed to reduce topic fragmentation caused by proper nouns and identifiers (for example, names,
addresses, policy numbers).

This utility is optional and deterministic. It relies on a local spaCy model and does not send text to external
services.

## When to use it

Use entity removal when you want topic modeling or clustering to focus on intent rather than unique values.
Common examples include:

- Removing addresses, policy numbers, and names in customer-service transcripts.
- Collapsing location-specific variants into a single intent bucket.
- Cleaning entity-heavy corpora where proper nouns dominate keywords.

## Installation

Entity removal uses spaCy. Install the optional dependency and the model:

```
pip install "biblicus[ner]"
python -m spacy download en_core_web_sm
```

## Configuration

Entity removal is configured under topic modeling and runs after LLM extraction (if enabled) and before lexical
processing.

```yaml
entity_removal:
  enabled: true
  provider: spacy
  model: en_core_web_sm
  entity_types:
    - PERSON
    - GPE
    - LOC
    - ORG
    - FAC
    - DATE
    - TIME
    - MONEY
    - PERCENT
    - CARDINAL
    - ORDINAL
  replace_with: ""
  collapse_whitespace: true
  regex_patterns:
    - "\\b\\d{5}(-\\d{4})?\\b"  # zip codes
    - "\\b\\d{3}[-. ]?\\d{3}[-. ]?\\d{4}\\b"  # phone numbers
  regex_replace_with: ""
```

### Notes

- When `entity_types` is empty, Biblicus uses a default set of common entity labels.
- Regex patterns run **after** NER removal.
- If `replace_with` is empty, entities are deleted.

## Output visibility

Topic modeling reports include an `entity_removal` report entry showing:

- provider and model used
- entity types removed
- input/output document counts
- regex patterns applied

When enabled, Biblicus writes `entity_removal.jsonl` alongside the topic modeling `output.json` file. This artifact
contains the redacted documents used for clustering and is reused when rerunning the same snapshot.