Toggle Menu

CFPB Fraud Detection with AI

Detecting Duplicate and Forged Documents with Artificial Intelligence

CFPB, their Mission, and the Challenge

The Consumer Financial Protection Bureau (CFPB) is a U.S. government agency that ensures banks, lenders, and other financial companies treat citizens fairly. CFPB has been bringing stakeholders together in a series of Tech Sprints to discover how to use advanced technologies to solve challenging problems. Some of the sprints have focused on managing large amounts of financial information, processing consumer complaints more effectively, and replacing paper-based processes with electronic ones.  


This concept inspired us and we decided to start our own sprint. We brought together a skilled data science Agile team to explore how Artificial Intelligence (AI) and Machine Learning (ML) could be used to improve the efficiency of CFPB’s services. 

AI/ML for Automated Processing

One of the main services of CFPB is to field consumer complaints. CFPB helps resolve disputes between citizens and private companies. In doing so, they manually review documents to ensure they are real and legitimate. The manual review is an ideal use case for AI/ML. At the same time, we planned to flag inaccurate submissions, helping protect consumers who use the process.  We’ve used AI/ML to detect and combat fraud, waste, and abuse in the past, making this a perfect fit for us. 

Identity theft is on the rise, and it is important to ensure the safety and financial security of the nation’s citizens. Consumers often submit documents to CFPB to establish proof of residency and identity. Utility bills are regularly used in this process, but they are susceptible to forgery. Forged utility bills can be used to wreak havoc with personal finances. Would it be possible to use AI/ML to flag fraudulent documents that might be used for identity theft? We wanted to find out! 

Our Vision 

We set out to create an extensible ML solution that could identify duplicate, and potentially forged, documents. If we could do that, we would provide two benefits:

  1. We would automate a manual review process and free time for analysts to work on more complex challenges.
  2. At the same time, we could reduce the risk of fraud, waste, and abuse.

We have developed similar AI/ML solutions for our federal and commercial clients. Our goal was to create a solution that would meet CFPB’s needs as well as those of other federal agencies and private sector businesses. Along the way, we could integrate the new solution with Excella’s Data Science Toolkit. Our toolkit is a set of tools, practices, and skill sets we apply to our data science and AI projects. Once we defined our goals and agreed to proceed, we started to build the solution.  

Our Solution

We considered utility bills the best starting point because people use utility bills to substantiate their identity and home address. The team built a utility bill generator and automatically created a set of 2500 example bills for our ML training data set. After we trained the model, we created many specious documents to test specific permutations of “fake” utility bills. We used these fakes to tune the model and calibrate it, allowing us to consistently detect forgeries under a variety of different conditions. 

Graph of cosine similarity by frequency, showing how forged and duplicative documents have higher cosine similarity

Our solution can review numerous document formats, including ones that it has not encountered before. It takes in documents for analysis, extracts the relevant text, and then performs Natural Language Processing (NLP) to create vector-based analyses of each document. Once the vector is calculated, we compare it to those of other documents using cosine similarity.

We use an adjustable threshold to tune the output to draw attention to the most similar documents. Using this approach, we were able to identify forged and faked documents reliably and consistently, even if some relevant details (such as names or addresses) had been changed.  


Our model was extremely effective. In our tests, we captured all the modified documents and achieved 100% success. At the same time, the model didn’t have any false positives. Because our results were consistent across a variety of different document formats, we know the model is applicable to multiple different use cases.  

Cosine similarity by frequency, with clear indication of the target documents beyond the tool's configurable threshold

These results proved that our AI/ML solution can automate tedious routines. That will free employees to spend more time reviewing potentially duplicative or suspicious documents. Overall, our solution will help improve customer satisfaction, combat identity theft, and prevent waste, fraud, and abuse. Now that we’ve integrated it with our Data Science Toolbox, our teams can use it to benefit their clients.

We’re excited to see what additional use cases they support and the creative ways Excellians will maximize the potential of AI/ML for future benefit. 

You can learn more about our approach to AI Solutions here.