Skip to main content
The matching pipeline resolves free-text entity names to Credit Benchmark identifiers across three steps: candidate retrieval, feature engineering, and ML scoring. Each input returns ranked matches with a confidence score between 0 and 1.
Entity
entity_name
country (opt)
industry (opt)
lei (opt)
1 — Candidate Entity Retrieval
Searches the CB Entity Database for likely candidates
2 — Feature Engineering
Measures name similarity and metadata alignment per candidate
3 — ML Scoring
Scores each candidate as a match probability
Top Result
CBId
CBEntityName
confidence
rank
This flow shows the end-to-end matching workflow from input entity fields to ranked candidate results.

Pipeline

Candidate Entity Retrieval

The CB Entity Database supports approximate text search, returning a shortlist of plausible candidates. Retrieval uses BM25 ranking — scoring candidates by term frequency and inverse document frequency — and normalises input text to handle punctuation, accents, legal suffixes, and common name variants. Around 20 candidates are retrieved per name. This stage prioritises recall over precision: the true match must appear in the candidate set before scoring can begin.

Feature Engineering

For each candidate, a feature vector x\mathbf{x} is built from dozens of individual signals, grouped into 4 categories:
CategoryExamples
String similarityJaccard token overlap, Levenshtein distance, n-gram similarity
Search relevanceBM25 score and rank position from the retrieval stage
Text normalisationComparison after stripping punctuation, accents, legal suffixes, and name variants
Metadata alignmentCountry, sector, and identifier (LEI) consistency between input and candidate

Machine learning (ML) scoring

A machine learning classifier assigns a match probability to each candidate independently: p^=P(matchx)\hat{p} = P(\text{match} \mid \mathbf{x}) Candidates are ranked by p^\hat{p} and the top results returned in the response.

Training

The model has been trained on an internal dataset of tens of thousands of labelled entity matches — each a true or false match pair. This is distinct from the CB Entity Database itself, which contains millions of records corresponding to observed entities from bank submissions. The model is retrained weekly as both the CB Entity Database and the matching universe grow.

Testing

Performance is evaluated using k-fold cross-validation, ensuring metrics reflect generalisation across the full labelled dataset rather than a single train/test split. As the model is retrained on new data, performance is re-evaluated each cycle. Classification metrics are reported on the Accuracy & Coverage page.

Confidence Score

Each candidate is returned with a score p^[0,1]\hat{p} \in [0, 1] reflecting the model’s certainty it is the correct match. Internally, we use the following bands as guidance, based on performance measured on our testing data:
RangeSignalMatch RateRationale
p^0.60\hat{p} \geq 0.60Strong match94.4%High enough confidence to treat as a match without manual review
0.30p^<0.600.30 \leq \hat{p} < 0.60Likely match~65%The model considers a match plausible but not certain — scores in this range warrant review before accepting
p^<0.30\hat{p} < 0.30Weak~35%The candidate is less likely to be the correct match — typically surfaced only to confirm no match exists
These figures reflect per-candidate match rates. When multiple candidates are returned (limit > 1) at lower scores, the true match may still be present somewhere in the result set — reviewing the top candidates collectively improves the chance of a correct resolution even when no single score is high. See Accuracy & Coverage for full threshold trade-off analysis.