Entity
entity_name
country (opt)
industry (opt)
lei (opt)
1 — Candidate Entity Retrieval
Searches the CB Entity Database for likely candidates
2 — Feature Engineering
Measures name similarity and metadata alignment per candidate
3 — ML Scoring
Scores each candidate as a match probability
Top Result
CBId
CBEntityName
confidence
rank
Pipeline
Candidate Entity Retrieval
The CB Entity Database supports approximate text search, returning a shortlist of plausible candidates. Retrieval uses BM25 ranking — scoring candidates by term frequency and inverse document frequency — and normalises input text to handle punctuation, accents, legal suffixes, and common name variants. Around 20 candidates are retrieved per name. This stage prioritises recall over precision: the true match must appear in the candidate set before scoring can begin.Feature Engineering
For each candidate, a feature vector is built from dozens of individual signals, grouped into 4 categories:| Category | Examples |
|---|---|
| String similarity | Jaccard token overlap, Levenshtein distance, n-gram similarity |
| Search relevance | BM25 score and rank position from the retrieval stage |
| Text normalisation | Comparison after stripping punctuation, accents, legal suffixes, and name variants |
| Metadata alignment | Country, sector, and identifier (LEI) consistency between input and candidate |
Machine learning (ML) scoring
A machine learning classifier assigns a match probability to each candidate independently: Candidates are ranked by and the top results returned in the response.Training
The model has been trained on an internal dataset of tens of thousands of labelled entity matches — each a true or false match pair. This is distinct from the CB Entity Database itself, which contains millions of records corresponding to observed entities from bank submissions. The model is retrained weekly as both the CB Entity Database and the matching universe grow.Testing
Performance is evaluated using k-fold cross-validation, ensuring metrics reflect generalisation across the full labelled dataset rather than a single train/test split. As the model is retrained on new data, performance is re-evaluated each cycle. Classification metrics are reported on the Accuracy & Coverage page.Confidence Score
Each candidate is returned with a score reflecting the model’s certainty it is the correct match. Internally, we use the following bands as guidance, based on performance measured on our testing data:| Range | Signal | Match Rate | Rationale |
|---|---|---|---|
| Strong match | 94.4% | High enough confidence to treat as a match without manual review | |
| Likely match | ~65% | The model considers a match plausible but not certain — scores in this range warrant review before accepting | |
| Weak | ~35% | The candidate is less likely to be the correct match — typically surfaced only to confirm no match exists |
limit > 1) at lower scores, the true match may still be present somewhere in the result set — reviewing the top candidates collectively improves the chance of a correct resolution even when no single score is high.
See Accuracy & Coverage for full threshold trade-off analysis.
