How to build a credit-scoring model with big data

Shot of a group of businesspeople sitting together in a meeting Shot of a group of businesspeople sitting together in a meeting Image:

In a Nutshell

Credit-scoring agencies and creditors are always working to improve their scoring models. Greater access to consumer data and developments in computing power may be changing that process.

Editorial Note: Credit Karma receives compensation from third-party advertisers, but that doesn’t affect our editors' opinions. Our marketing partners don’t review, approve or endorse our editorial content. It’s accurate to the best of our knowledge when it’s posted.
Advertiser Disclosure

We think it's important for you to understand how we make money. It's pretty simple, actually. The offers for financial products you see on our platform come from companies who pay us. The money we make helps us give you access to free credit scores and reports and helps us create our other great tools and educational materials.

Compensation may factor into how and where products appear on our platform (and in what order). But since we generally make money when you find an offer you like and get, we try to show you offers we think are a good match for you. That's why we provide features like your Approval Odds and savings estimates.

Of course, the offers on our platform don't represent all financial products out there, but our goal is to show you as many great options as we can.

Companies continually test, build and update their credit-scoring models. Here’s a peek at that process and how technological advances could change it.

You may occasionally see headlines when credit-scoring companies like FICO or VantageScore release a new credit-scoring model. There might be a discussion in the media about how those new models could affect consumers’ credit scores and ability to get approved for loans and credit cards. That’s only part of the picture, though.

Unknown to many consumers, large financial services companies are continually creating and updating custom scoring models. They may use these models instead of, or in combination with, scores created by credit-score-industry heavyweights FICO or VantageScore.

Today, companies are able to gather and analyze vast amounts of data, which they can use to help determine your score using custom scoring models.


Why create custom scoring models?

“You can buy a [generic] score, and it works well,” says Naeem Siddiqi, director of credit scoring at SAS, a data analytics and management company, and author of several books on the topic, referring to scores created in the credit industry. “But the bigger banks want to do better than a generic bureau score, so they build internal models.”

A generic score is generally based solely on the information in credit reports from the three major consumer credit bureaus, while custom models can incorporate a wide range of data, such as information from an application for a financial product or account and internal data on current or past customers.

The inclusion of these alternative data points (meaning data not typically found in your credit reports) is designed to help companies better understand the risks and opportunities associated with their particular customers and prospects.

4 steps to create and implement a new scoring model

There are different ways to develop a new credit-scoring or risk model, but here’s an overview of what it may look like.

Step 1: Defining a goal

The first step is deciding on a goal, or what the scoring model is meant to predict. With generic credit-scoring models, the goal is usually to predict the likelihood that someone will be 90 days late on a loan payment within two years of taking out the loan.

Creditors may want to build custom scoring models that help predict the same thing more accurately by using internal company data. Or they may have other goals in mind.

“Most banks are running a lot of models at the prospecting, new-account-underwriting, and portfolio-management stage … (while) there might be another set for handling delinquent accounts,” says Duane Good, managing partner at RiskThought, a risk-management advisory firm.

Custom models can be built to help predict things like the likelihood that a consumer will accept a credit card offer, become a profitable customer, keep current on a bill or declare bankruptcy.

Step 2: Gathering data and building the model

With a goal decided, the next step is for companies to find data to build and test the model.

For start-ups with little or no data of their own, the answer is to build a model using anonymized data, says Paul Greenwood, president and co-founder of GDS Link, which creates credit-risk-management software. Companies that are already established may have customer data they can use for this purpose.

Using the data, companies can examine consumer behavior like account openings or bill-payment patterns — or behavior related to a different outcome they want to predict. That allows the company to build models that find connections between the anonymous consumers’ profiles and that outcome.

Then the variables need to be ranked according to how relevant they are to the predicted outcome, says Florian Lyonnet, chief data scientist at GDS Link. For example, it may turn out that a consumer’s history of on-time payments is a powerful predictor of whether they’ll make on-time payments in the future.

The list of variables can then be whittled down to a smaller set of maybe 10 to 20 of the most predictive variables. These final variables can be analyzed to help predict the probability of the company’s goal occurring.

Score developers have to be careful about which variables wind up in the final scoring model, though. That’s because scoring models must comply with a range of regulations. For example, a creditor can’t use nonpredictive data, such as race, as a variable in a credit-scoring model.

Step 3: Validating the model

New models can then be evaluated — or validated — to ensure that they have consistent and accurate outcomes and comply with regulations.

Consider a model that’s designed to predict the likelihood that someone will be 90-plus days late on a payment and ranks consumers on a scale of 1 to 100, with a higher number indicating the consumer is less likely to be late. On this scale, someone with a score of 90 should always be a lower risk than someone with a score of 60.

The chance that someone with a given score may fall behind can change with shifts in consumer behavior and the economy. For example, people may be more likely to pay a bill late during a recession. But the rank-order of consumers from lower to higher risk should still be about the same as before the downturn.

Companies may periodically revalidate their models to see if they need adjusting or to see if the model’s outputs call for any changes in business strategy.

Step 4: Testing and implementing a new model

Businesses may continually evolve and test their models, trying to figure out which ones work best in a given situation.

“You’ll have your production model that you’re using today,” says Good. “At the same time, [companies] are monitoring and back-testing to see how it performs against the vintage models that were previously used.”

And there are new models, sometimes called “challenger” models, that are tested against the current production, or “champion,” models.

Once a new model meets regulatory requirements and is validated, it can go into production alongside the current champion models.

“Sometimes it’s very subtle, and you’re just checking slight variations,” says Good. “You think you have some better data, maybe new bureau data or alternative data.”

Companies will run the challenger models with real customers or applicants, comparing performance against the champion models. If a new model outperforms the current model, it may become a new champion model.

Using big data and machine learning in model development

Companies have access to vast amounts of information about their customers and prospects. A credit card issuer, for example, may have its internal customer records and access to credit reports, but can also boost that trove by buying other data.

There may be valuable insights buried in the data, but companies may face challenges sorting through and understanding everything. Some companies are using machine learning to make sense of it all and develop new scoring models.

With machine learning, you can “train” a model to find patterns within large data sets. The resulting model may be able to score more people, or more accurately score people, by finding previously unrecognized correlations. Additionally, model developers may be able to use machine learning to quickly analyze large amounts of data, which can help with creating new models in less time.

But machine learning offers new challenges alongside the opportunities.

“While you can feed it data and it gives you an output, you might not know why,” says Siddiqi.

So-called “black box” models can be insufficient, because they may not provide certain info and explanations behind peoples’ scores. That can be problematic when you have requirements to provide regulators insight into your scoring model and how it makes determinations.

“Typically, what we’ve seen in the industry, at least in my opinion, is that people are reluctant to use machine learning for originations,” says Lyonnet.

However, the techniques could be used for other purposes, like detecting fraud, or in less-regulated lending environments outside the U.S.


Bottom line

Credit-scoring agencies and creditors continually test and build new credit-scoring models. The availability of “big data” could create opportunities for creditors who want to prospect consumers, approve new accounts, manage customers and increase profits. But companies may also need to learn how to implement machine learning — possibly the most efficient way to analyze the data — in a way that meets certain regulatory requirements.