Multiple approaches to dimensionality reduction for pattern discovery, visualization and drawing decision boundaries
Background: The case for dimensionality reduction
My unit and I are recently assigned on a client project. The business executives from the client team are one of the world’s largest publishers (the “supply” side in the advertising industry; the opposite “demand” side being advertisers) and programmatic ad ecosystem in the world.
After a quick discovery call and a quick sign-off, we embarked on the project. The client sent us a sample of device-level data containing ±240 columns (variables) and we scoped our first phase of engagement to focus on extracting patterns or features that are indicative of the gender of the device owner.
In other words, we would be extracting predictive features or determining decision boundaries that separate a Male from a Female using the raw device-level data collected by the client team (henceforth defined as the “research objective”).
While a separate research effort was independently conducted to treat this problem in a purely supervised, classification fashion, our team uses dimensionality reduction, a procedure that transform the data from a high-dimensional space to a lower level ones while retaining meaningful properties / correlations in the data.
Wikipedia says this about dimensionality reduction,
Working in high-dimensional spaces can be undesirable for many reasons; raw data are often sparse as a consequence of the curse of dimensionality, and analyzing the data is usually computationally intractable. Dimensionality reduction can be used for noise reduction, data visualization, cluster analysis, or as an intermediate step to facilitate other analyses
In our case, not all 240+ variables are significant, or even meaningful predictors of gender – many of them can be noise rather than signals.
We used unsupervised machine learning methods such as Principal Component Analysis (PCA) and Factor Analysis of Mixed Data (FAMD) to produce dimensionality reduction.
Principal Component Analysis (PCA)
The purpose of PCA is to find a lower-dimensional set of axes or variables that summarizes the data using each variable’s variances. It’s easy to assume that variables or features with high variance are more likely to have a good predictor than the lower one. For example, think about the dimension of a cube, a tube, and a prism — all of them have 3 dimensions. Consider how many sides each of these have: 6 for a cube, 3 for a tube, and 5 for the prism, while maintaining a dimension of 3. Dimensionality reduction processes like PCA and FAMD aims reduce the variables with lower variances like the example above and remove the redundancy of the variables. They look at the variance of each variable and construct new dimensions that consist of the old variables.
The Dimensionality Reduction Process
Our team uses the R programming language, and example code will be provided in the sections below.
Our client’s data are stored in S3 buckets, so we start by fetching the data (or connecting directly to the buckets) and begin a series of data cleansing process – this includes removal of variables that are not meaningful, outliers due to data entry error, and standardizing the numeric variables so they are on a consistent scale.
sample1_num_scale <- data.frame(scale(sample1_num))
We then convert the character variables to factor class and remove the sparse levels to avoid having a large number of factor levels (an alternative is to make use of the
fct_lump() function from the
tidyverse package for this transformation)
'%notin%' <- Negate('%in%')
sample1[which(sample1[,"source_ad_format"] %notin% c("universal","html","vast")),"source_ad_format"]<-"OTHER"
We proceed to make dummy variables for our factors and remove variables that are too homogeneous using the function
nearZeroVar() as the final step in our data processing and preparation step.
Performing PCA in R
prcomp() function in base R is used to produce our principal components from the data. An alternative to
prcomp() would be the
princomp() function, with the latter using a spectral decomposition approach while the former using singular value decomposition (SVD) – generally SVD is more commonly used and according to R also produces better numerical accuracy.
pca <- prcomp(sample1)
To help in the exploratory process, we can add visual elements to our report. An example is the scree plot, which visualizes the percentage of variances explained by each principal component:
fviz_eig(pca, addlabels = TRUE, ylim = c(0, 30))
Another example is the contribution plot, which visualizes the percentage of original variables contribution to each principal component:
fviz_contrib(pca, choice = "var", axes = 1, top = 10)
We can also visualize how our data is projected on the first two principal components (a “biplot”) using the famous
ggplot2 visualization library:
df_pca <- data.frame(pca$x[,1],pca$x[,2],pca$x[,3],gender)
labs(x="PC 1", y="PC 2")
This yields the following plot: