The Use of Principal Components Analysis to Mask Sensitive Data in Machine Learning For Fraud Detection

When trying to predict and model fraudulent e-commerce behavior, there are a variety of variables that need to be taken into account: the billing address location compared to the shipping location, the amount being charged to the card, whether or not the purchase has been verified by other means, and more are all useful for predicting whether or not a given purchase is fraudulent. However, this also raises the problem of keeping such sensitive and individual data safe. Fields such an individual’s name, their email, or any of their billing information could potentially be used by malicious third parties to commit further cases of fraud and could endanger the private information of eCommerce clients.

In an effort to combat this, the Payment Card Industry Data Security Standard (PCI DSS) was created to ensure that minimum levels of security are met in the storing, processing, and transmission of cardholder data. Among these standards are twelve requirements for building and maintaining a secure network and a safe system, including guidelines for protecting cardholder data using anti-virus software, restricting access to confidential data, and encryption of said data.¹ In this paper, we will focus on exploring the latter of these, including the problems inherent in removing or coarsening data that may be essential to a machine learning algorithm for fraud detection, and how to combat these issues using statistical techniques, namely principal component analysis.

Data Removal and Encryption

One of the more common “solutions” to the problem of sensitive data is to simply remove all sensitive fields. But while secure, this method can result in the removal of data useful or even essential to the machine learning algorithm. For instance, an individual’s billing address certainly constitutes sensitive data that could be used to identify them, but if this field is completely removed, an ML algorithm would lose a great deal of valuable information, including that individual’s location, the ability to compare billing location with shipping location, and so on, resulting in a weakening of the algorithm’s power and effectiveness. Another common alternative is to encrypt the data before sending it to the machine learning engineers. However, most encryptions can be broken through frequency analysis attacks, leaving sensitive data vulnerable to bad actors.

Data Coarsening

A form of “middle ground” solution to the problems posed by data removal and data encryption can be found in data “coarsening.” This entails the rounding off of data to a lower rate of precision, such that the essential information is still sent to machine learning engineers for use in training the algorithm, but additional details and individual information are withheld. In other words, the data goes from being specific to each individual to being grouped into larger “buckets” for higher security.

For example, IP addresses are often used in machine learning algorithms and can be treated like physical addresses in terms of sensitivity. One coarsening technique would be to remove the last eight numbers of an IPv4 address, which would have the same function as replacing the latitude and longitude of a user with the city that the user is in. In other words, this coarsening would lower the precision with which an individual’s geographical location is measured in exchange for protection.

Of course, this method has the same issues as data removal, though to a lesser extent: it is possible that essential information is being obscured from the machine learning algorithm. However, it is still a major improvement to the complete removal of data and is more secure than simple encryption.

Data Masking

The final method we will consider here is data masking: the alteration of certain fields such that the most essential information is provided to the machine learning algorithm, but in such a way that individual information remains obscured or hidden from third parties. This often entails a transformation of the data collected and the creation of new fields for processing in the algorithm.

For example, let us assume that an individual’s billing address and shipping address are collected at the time of purchase. A machine-learning algorithm might take into account the geographical distance between these two locations: if a credit card’s billing address is in the United States, but the purchase is being shipped to Russia, this may indicate a heightened possibility of fraudulent behavior. In this case, the variable of interest is not intrinsic to the variables of billing address and shipping address individually, but rather to the relationship between the two (e.g. the distance between them). As such, instead of sending these addresses, which constitute sensitive data, to the machine learning algorithm, the numeric distance could be calculated ahead of time, and that number could be sent instead. The algorithm would still have access to the data of interest, while malicious third parties would have a number with no inherent meaning. These sorts of simple transformations can help mask sensitive data while still providing the machine learning algorithm with essential information.

Principal Components Analysis (PCA)

Another slightly more complex example of data masking can be performed using Principal Component Analysis, or PCA. PCA was developed as a means of reducing the number of variables necessary to take into account when analyzing data. It accomplishes this by creating new “principal components” from linear transformations of the original variables.² For instance, if the variables collected during a transaction include the amount of money spent, the distance between billing address and shipping address, and the number of items purchased, a principal component might be calculated as:

Each observation would have a value for that principal component, which explains a certain amount of variation in the data, and is thus useful to the machine learning algorithm. If those numeric values are sent without an explanation of how they were calculated, the algorithm could still be trained on these values and make more accurate predictions, but malicious actors would not know how to interpret those masked values. Even if one were to identify individual observations that were unique in some way, it would still be difficult to determine without an explanation of the PCA vector formula what exactly makes them unique. However, it is important to note that PCA processing reduces the data distribution, and can still result in a trade of accuracy for security.

PCA Example

Data Prep

Let’s take a closer look at this PCA method with a concrete example using the Heart Disease UCI dataset from Kaggle. In this example, we pass two separate datasets to a machine learning algorithm: one of the datasets will be the data as it was found on Kaggle, while the other will contain data masked through PCA.

Note that we only perform PCA on the parameters and not the target value, which we aim to predict.
Now, we have two datasets: data, the original dataset, and data_masked the data masked through PCA. Let’s take a brief look at them:

data:

The regular dataset gives clear, easy to understand values for parameters such as age, sex, cholesterol, etc.
It’s worth noting here that the values for each parameter vary quite a bit, with age ranging from 29 to 77, while parameters like chol range from 126 to 564. We will perform z-score standardization to ensure that these values are normalized for the model (not normalizing the target value, which we hope to predict).

data_norm:

Now the ranges of all of the parameters are adjusted, so the distances between variables with larger ranges will not be over-emphasized. Note that this is unnecessary for the masked data, as PCA already scales the ranges.

Next, let’s take a look at the dataset masked using PCA.

data_masked:

The dataset masked with PCA still provides us with 13 principal components, but their values are far more difficult to interpret — what does a PC1 value of -0.623 indicate? Without the formula used to calculate it, we would have difficulty determining the information behind the number.

ML Training

Now, let’s train two separate machine learning algorithms on these data sets to see what is sacrificed in terms of accuracy for increased security. Let’s start with the normal (unmasked) dataset.

We begin by splitting the data into test and training sets.

Next, we will perform classification using a Generalized Linear Model (glmnet). We will measure the model’s accuracy using the AUC (Area Under Curve) score, which essentially represents how accurate a model is at correctly distinguishing between classes. An accurate model has an AUC score close to 1.0.

This is a reasonably high AUC score, meaning this model trained on the actual (unmasked) data does very well.
Now, let’s try training a new model on the masked data.

Even though our data is masked, we still maintain a very high AUC score, in this case even higher than the score for the model trained on the unmasked data. In other words, no accuracy was sacrificed for security.

The Conclusion

It is worth noting that this is a highly simplified example — the data went through almost no pre-processing, nor did we spend any time fine-tuning either model. It is certainly possible that the model trained on the unmasked data could have been better tuned than the model trained on the masked data, or vice versa. This is only meant to illustrate that the loss of accuracy in exchange for security can be trivial or even non-existent, thus allowing us to use PCA masking to keep sensitive data secure without harming the performance of our model.