In my previous blog post, I introduced ML.NET, the open-source, cross-platform library for Machine Learning in .NET. In this post, we’ll further explore how we can use ML.NET, to implement a binary classifier to detect fraudulent transactions. Terminology Before we get started, let’s take a minute to demystify some machine learning terminology that will be used […]
In my previous blog post, I introduced ML.NET, the open-source, cross-platform library for Machine Learning in .NET. In this post, we’ll further explore how we can use ML.NET, to implement a binary classifier to detect fraudulent transactions.
Before we get started, let’s take a minute to demystify some machine learning terminology that will be used throughout this post.
A feature is a specific column in a dataset. Some examples of features, as they relate to fraud detection, are the transferred amount, location, and the destination account. It’s not uncommon to have thousands of features against which to train your model.
There are two primary types of machine learning: supervised and unsupervised. Fraud detection can be classified as supervised learning, through which we train our model on data that is already marked as fraudulent or not. This data source is called labeled data, and we train the model to predict a label for a new data set to which it has yet to be exposed, or trained against.
A commonly used metric to evaluate the performance of our model is accuracy. Accuracy will measure the percentage of correctly classified transactions in the test set.
When working with classifiers, precision and recall are two metrics that are very important. If your dataset is highly unbalanced, e.g. we only have 1% of all transactions being fraudulent, the accuracy of the model will be 99% if the model just guesses non-fraudulent for everything. This is obviously not very helpful, and as such we can use precision and recall to gain a better understanding of the quality of our model. Both precision and recall can be between 0 and 1, with closer to 1 being preferred. Precision aims to reduce the number of false positives, while recall aims to reduce the number of false negatives (e.g. making sure we do not miss any transactions that were in fact fraudulent).
Building a machine learning model in .NET generally follows the same steps as in Python or R.
As we already have decided that we will be modeling fraud detection, our next step is to decide where we can get the data. In a real-world example, this kind of data would be business internal, but for demonstration purposes, we can use a dataset from the data competition platform Kaggle.
To get started, let’s create a .NET Core console application (download the .NET Core SDK if not already installed).
Open Visual Studio Code and in the terminal execute:
Navigate to the newly created solution:
Add the required NuGet packages.
To open the folder in Visual Studio Code, execute:
In ML.NET, all operations such as data load, data transformations, and algorithms are located on the MLContext. In your Main method, add:
Data can be loaded from a file or database tables, among many other options. The schema is defined similarly to the Entity Framework. Given that we have added a POCO with a property for each column in the dataset and decorated each property with the LoadColumn(0) attribute, in order to define the order of the property within the dataset, we can go ahead and load the data into memory. Please also add a variable named “DataPath” that defines the path to your dataset.
How do we know how good our model is? We measure it, and we do that with about 20% of the dataset that we put aside prior to training. To split the dataset into a training and test dataset, add the following line to your Main method.
For a machine learning model to understand the data it is given, it needs to be transformed into a numerical form. There are a couple of ways this can be done, but let’s walk through one of the more common ones, OneHotEncoding. OneHotEncoding takes a look at the values in a provided column and creates binary columns for each permutation with a value of 0 or 1. In our dataset, we have a column named “type”. Some values in the column are payment and transfer, labeled as such. What OneHotEncoding will do, is to create additional columns in the dataset for type named type_payment and type_transfer in which a transaction of type Payment will have a 1 in the type_payment column and a 0 in the type_transfer column. In addition to transforming the type column, we also want to select which columns (or features) in the dataset should be used during the training. All of these operations can be found on the MLContext and chained together to a data processing pipeline.
It’s easy to get lost when selecting an algorithm. I always try to encourage people not to be afraid of trial and error. After all, machine learning is more of an empirical than theoretical science. A good starting point is Microsoft Docs.
For our example, we are working with a highly unbalanced dataset. Decision trees are generally good at handling this kind of scenario. We will specifically look at an ensemble of multiple decision trees all pooled together to create a final prediction. To append a trainer and train our model, add the following lines to your Main method:
Evaluating our model is a two-step process:
To transform our test data using the trained model, simply call the method “Transform” on the trained model, passing in the test dataset as an argument.
To calculate the metrics for our model, use the BinaryClassification evaluator on the MLContext
The metric variable will contain values for metrics such as accuracy, precision, and recall.
Models in ML.NET are saved as a .zip file. To save our trained model for later use, add the following line to your Main method:
By hitting F5, you should by now successfully be able to run your console application and yield a custom trained machine learning model. Congratulations!
In this post, we’ve taken a good look at how you can create your own fraud detection classifier in ML.NET, with just under 50 lines of code. The finished model is saved as zip file and can be integrated into an ASP.NET Core application or in an Azure function. In a following post, we’ll explore these various deployment options for our new model. Thank you for reading!
Bringing Machine Learning to the .NET Community
AI, Deep Learning, Machine Learning. You're constantly bombarded with articles on new breakthroughs and discoveries....
Microsoft recently reached a major milestone in the development of the open-source cross-platform library ML.NET...
Why Automated Machine Learning is Here to Stay
Getting good at Machine Learning requires hours and hours of study, and for many, the...