Building a Fraud Detection Classifier in ML.NET

In my previous blog post, I introduced ML.NET, the open-source, cross-platform library for Machine Learning in .NET. In this post, we’ll further explore how we can use ML.NET, to implement a binary classifier to detect fraudulent transactions.

Terminology

Before we get started, let’s take a minute to demystify some machine learning terminology that will be used throughout this post.

Features

A feature is a specific column in a dataset. Some examples of features, as they relate to fraud detection, are the transferred amount, location, and the destination account. It’s not uncommon to have thousands of features against which to train your model.

Labels

There are two primary types of machine learning: supervised and unsupervised. Fraud detection can be classified as supervised learning, through which we train our model on data that is already marked as fraudulent or not. This data source is called labeled data, and we train the model to predict a label for a new data set to which it has yet to be exposed, or trained against.

Accuracy

A commonly used metric to evaluate the performance of our model is accuracy. Accuracy will measure the percentage of correctly classified transactions in the test set.

Precision and Recall

When working with classifiers, precision and recall are two metrics that are very important. If your dataset is highly unbalanced, e.g. we only have 1% of all transactions being fraudulent, the accuracy of the model will be 99% if the model just guesses non-fraudulent for everything. This is obviously not very helpful, and as such we can use precision and recall to gain a better understanding of the quality of our model. Both precision and recall can be between 0 and 1, with closer to 1 being preferred. Precision aims to reduce the number of false positives, while recall aims to reduce the number of false negatives (e.g. making sure we do not miss any transactions that were in fact fraudulent).

Getting started with ML.NET

Building a machine learning model in .NET generally follows the same steps as in Python or R.

Determine your problem domain
Gather your data
Split your data into a training and test set
Transform your data
Train your model
Evaluate your model
Deploy to production

As we already have decided that we will be modeling fraud detection, our next step is to decide where we can get the data. In a real-world example, this kind of data would be business internal, but for demonstration purposes, we can use a dataset from the data competition platform Kaggle.

1. Creating a solution

To get started, let’s create a .NET Core console application (download the .NET Core SDK if not already installed).

Open Visual Studio Code and in the terminal execute:

dotnet new console -o FraudDetectionTrainer

Navigate to the newly created solution:

cd FraudDetectionTrainer

Add the required NuGet packages.

dotnet add package Microsoft.ML
dotnet add package Microsoft.ML.FastTree

To open the folder in Visual Studio Code, execute:

code . -r

In ML.NET, all operations such as data load, data transformations, and algorithms are located on the MLContext. In your Main method, add:

var mlContext = new MLContext(seed: 1);

2. Load the data

Data can be loaded from a file or database tables, among many other options. The schema is defined similarly to the Entity Framework. Given that we have added a POCO with a property for each column in the dataset and decorated each property with the LoadColumn(0) attribute, in order to define the order of the property within the dataset, we can go ahead and load the data into memory. Please also add a variable named “DataPath” that defines the path to your dataset.

var data = mlContext.Data.LoadFromTextFile<Transaction>(DataPath, hasHeader: true, separatorChar: ',');

3. Split the data

How do we know how good our model is? We measure it, and we do that with about 20% of the dataset that we put aside prior to training. To split the dataset into a training and test dataset, add the following line to your Main method.

var testTrainData = mlContext.Data.TrainTestSplit(data);

4. Transform your data

For a machine learning model to understand the data it is given, it needs to be transformed into a numerical form. There are a couple of ways this can be done, but let’s walk through one of the more common ones, OneHotEncoding. OneHotEncoding takes a look at the values in a provided column and creates binary columns for each permutation with a value of 0 or 1. In our dataset, we have a column named “type”. Some values in the column are payment and transfer, labeled as such. What OneHotEncoding will do, is to create additional columns in the dataset for type named type_payment and type_transfer in which a transaction of type Payment will have a 1 in the type_payment column and a 0 in the type_transfer column. In addition to transforming the type column, we also want to select which columns (or features) in the dataset should be used during the training. All of these operations can be found on the MLContext and chained together to a data processing pipeline.

var dataProcessingPipeline = mlContext.Transforms.Categorical.OneHotEncoding("type")
.Append(mlContext.Transforms.Concatenate("Features", "type", "amount", "oldbalanceOrg", "oldbalanceDest", "newbalanceOrig", "newbalanceDest"));

5. Train our model

It’s easy to get lost when selecting an algorithm. I always try to encourage people not to be afraid of trial and error. After all, machine learning is more of an empirical than theoretical science. A good starting point is Microsoft Docs.

For our example, we are working with a highly unbalanced dataset. Decision trees are generally good at handling this kind of scenario. We will specifically look at an ensemble of multiple decision trees all pooled together to create a final prediction. To append a trainer and train our model, add the following lines to your Main method:

var trainingPipeline = dataProcessingPipeline.Append(mlContext.BinaryClassification.Trainers.FastTree(new FastTreeBinaryTrainer.Options 
{ 
	NumberOfLeaves  = 10, 
	NumberOfTrees = 50, 
	LabelColumnName  =  "isFraud", 
	FeatureColumnName  =  "Features" 
}));

var trainedModel = trainingPipeline.Fit(testTrainData.TrainSet);

6. Evaluate our model’s performance

Evaluating our model is a two-step process:

Transforming our test dataset using the trained model
Calculating metrics based on probabilities of the predicted values and the true values

To transform our test data using the trained model, simply call the method “Transform” on the trained model, passing in the test dataset as an argument.

var predictions = trainedModel.Transform(testTrainData.TestSet);

To calculate the metrics for our model, use the BinaryClassification evaluator on the MLContext

The metric variable will contain values for metrics such as accuracy, precision, and recall.

var metrics = mlContext.BinaryClassification.Evaluate(predictions, labelColumnName: "isFraud");

7. Save our model

Models in ML.NET are saved as a .zip file. To save our trained model for later use, add the following line to your Main method:

mlContext.Model.Save(trainedModel, data.Schema, "MLModel.zip");

By hitting F5, you should by now successfully be able to run your console application and yield a custom trained machine learning model. Congratulations!

Conclusion

In this post, we’ve taken a good look at how you can create your own fraud detection classifier in ML.NET, with just under 50 lines of code. The finished model is saved as zip file and can be integrated into an ASP.NET Core application or in an Azure function. In a following post, we’ll explore these various deployment options for our new model. Thank you for reading!

Category: .NET

Tags: Machine Learning