Name Matching & CBID Mapping - Credit Benchmark Docs

Each input returns ranked matches with a confidence score between 0 and 1.

Entity

entity_name
country (opt)
industry (opt)
lei (opt)

1 — Candidate Entity Retrieval

Searches the CB Entity Database for likely candidates

2 — Feature Engineering

Measures name similarity and metadata alignment per candidate

3 — ML Scoring

Scores each candidate as a match probability

Pipeline

Candidate Entity Retrieval

The CB Entity Database supports approximate text search, returning a shortlist of plausible candidates. Retrieval uses BM25 ranking — scoring candidates by term frequency and inverse document frequency — and normalises input text to handle punctuation, accents, legal suffixes, and common name variants. Around 20 candidates are retrieved per name. This stage prioritises recall over precision: the true match must appear in the candidate set before scoring can begin.

Feature Engineering

For each candidate, a feature vector

\mathbf{x}

is built from dozens of individual signals, grouped into 4 categories:

Category	Examples
String similarity	Jaccard token overlap, Levenshtein distance, n-gram similarity
Search relevance	BM25 score and rank position from the retrieval stage
Text normalisation	Comparison after stripping punctuation, accents, legal suffixes, and name variants
Metadata alignment	Country, sector, and identifier (LEI) consistency between input and candidate

ML scoring

A machine learning classifier assigns a match probability to each candidate independently:

\hat{p} = P(\text{match} \mid \mathbf{x})

Candidates are ranked by

\hat{p}

and the top results returned in the response.

Training

The model has been trained on an internal dataset of tens of thousands of labelled entity matches — each a true or false match pair. This is distinct from the CB Entity Database itself, which contains millions of records corresponding to observed entities from bank submissions. The model is retrained weekly as both the CB Entity Database and the entity resolution universe grow.

Testing

Performance is evaluated using k-fold cross-validation, ensuring metrics reflect generalisation across the full labelled dataset rather than a single train/test split. As the model is retrained on new data, performance is re-evaluated each cycle. Classification metrics are reported on the Accuracy & Coverage page.

Confidence Score

Each candidate is returned with a score

\hat{p} \in [0, 1]

reflecting the model’s certainty it is the correct match. Internally, we use the following bands as guidance, based on performance measured on our testing data:

Range	Signal	Match Rate	Rationale
p > 0.6	Strong match	94.4%	High enough confidence to treat as a match without manual review
p in [0.3, 0.6]	Likely match	~65%	The model considers a match plausible but not certain — scores in this range warrant review before accepting
p < 0.3	Weak	~35%	The candidate is less likely to be the correct match — typically surfaced only to confirm no match exists

These figures reflect per-candidate match rates. When multiple candidates are returned (limit > 1) at lower scores, the true entity may still be present somewhere in the result set — reviewing the top candidates collectively improves the chance of a correct resolution even when no single score is high. See Entity Resolution: Accuracy & Coverage for full threshold trade-off analysis.

​Pipeline

​Candidate Entity Retrieval

​Feature Engineering

​ML scoring

​Training

​Testing

​Confidence Score