supertype consultancy

Decision boundaries with PCA and FAMD

Using dimensionality reduction for gender classification

gerald supertype

Gerald Bryan, Unit 5

Multiple approaches to dimensionality reduction for pattern discovery, visualization and drawing decision boundaries

Background: The case for dimensionality reduction

My unit and I are recently assigned on a client project. The business executives from the client team are one of the world’s largest publishers (the “supply” side in the advertising industry; the opposite “demand” side being advertisers) and programmatic ad ecosystem in the world.

After a quick discovery call and a quick sign-off, we embarked on the project. The client sent us a sample of device-level data containing  ±240 columns (variables) and we scoped our first phase of engagement to focus on extracting patterns or features that are indicative of the gender of the device owner.

In other words, we would be extracting predictive features or determining decision boundaries that separate a Male from a Female using the raw device-level data collected by the client team (henceforth defined as the “research objective”).

While a separate research effort was independently conducted to treat this problem in a purely supervised, classification fashion, our team uses dimensionality reduction, a procedure that transform the data from a high-dimensional space to a lower level ones while retaining meaningful properties / correlations in the data.

Wikipedia says this about dimensionality reduction,

Working in high-dimensional spaces can be undesirable for many reasons; raw data are often sparse as a consequence of the curse of dimensionality, and analyzing the data is usually computationally intractable. Dimensionality reduction can be used for noise reduction, data visualization, cluster analysis, or as an intermediate step to facilitate other analyses

In our case, not all 240+ variables are significant, or even meaningful predictors of gender – many of them can be noise rather than signals.

We used unsupervised machine learning methods such as Principal Component Analysis (PCA) and Factor Analysis of Mixed Data (FAMD) to produce dimensionality reduction.

Principal Component Analysis (PCA)

The purpose of PCA is to find a lower-dimensional set of axes or variables that summarizes the data using each variable’s variances. It’s easy to assume that variables or features with high variance are more likely to have a good predictor than the lower one. For example, think about the dimension of a cube, a tube, and a prism — all of them have 3 dimensions. Consider how many sides each of these have: 6 for a cube, 3 for a tube, and 5 for the prism, while maintaining a dimension of 3. Dimensionality reduction processes like PCA and FAMD aims reduce the variables with lower variances like the example above and remove the redundancy of the variables. They look at the variance of each variable and construct new dimensions that consist of the old variables.

The Dimensionality Reduction Process

Our team uses the R programming language, and example code will be provided in the sections below.

Our client’s data are stored in S3 buckets, so we start by fetching the data (or connecting directly to the buckets) and begin a series of data cleansing process – this includes removal of variables that are not meaningful, outliers due to data entry error, and standardizing the numeric variables so they are on a consistent scale.

sample1_num_scale <- data.frame(scale(sample1_num))

We then convert the character variables to factor class and remove the sparse levels to avoid having a large number of factor levels (an alternative is to make use of the fct_lump() function from the tidyverse package for this transformation)

'%notin%' <- Negate('%in%')
sample1[which(sample1[,"source_ad_format"] %notin% c("universal","html","vast")),"source_ad_format"]<-"OTHER"

We proceed to make dummy variables for our factors and remove variables that are too homogeneous using the function nearZeroVar() as the final step in our data processing and preparation step.

Performing PCA in R

The prcomp() function in base R is used to produce our principal components from the data. An alternative to prcomp() would be the princomp() function, with the latter using a spectral decomposition approach while the former using singular value decomposition (SVD) – generally SVD is more commonly used and according to R also produces better numerical accuracy.

pca <- prcomp(sample1)

To help in the exploratory process, we can add visual elements to our report. An example is the scree plot, which visualizes the percentage of variances explained by each principal component:

fviz_eig(pca, addlabels = TRUE, ylim = c(0, 30))

scree plot pca r

 

Another example is the contribution plot, which visualizes the percentage of original variables contribution to each principal component:

fviz_contrib(pca, choice = "var", axes = 1, top = 10)

pea visualization contribution

 

We can also visualize how our data is projected on the first two principal components (a “biplot”) using the famous ggplot2 visualization library:

df_pca <- data.frame(pca$x[,1],pca$x[,2],pca$x[,3],gender)

ggplot(df_pca, aes(x=pca.x...1.,y=pca.x...2.))+
  geom_point(aes(col=gender))+
  labs(x="PC 1", y="PC 2")

This yields the following plot:

ggplot biplot r

 

A demonstration of 3d visualization courtesy of plotly, which allows us to visualize our data in the reduced dimensional space (rotate your phone to horizontal to give it space to render):

plot_ly(x=df_pca$pca.x...1., y=df_pca$pca.x...2.,z=df_pca$pca.x...3., 
        type="scatter3d", mode="markers", color = df_pca$gender)

Since we construct our modeling using a sampled set from the data, additional statistical tests have to be conducted to ensure there are no sampling error.

Alternatives to PCA

Since the data we work with consist of quantitative and qualitative variables, using PCA require us to create dummy variables in order to obtain a fully numeric matrix. We attempted another approach, which is more suited for mixed-type data: the FAMD and PCAmix.

Factor Analysis of Mixed Data (FAMD)

The steps involved in FAMD are largely similar to the ones in PCA; the main difference being that we do not create dummy variables for the qualitative data during the data preparation phase. We then call FAMD:

famd <- FAMD(data,graph=FALSE)

We have at our disposal the same visual tools as above: the scree plot, contribution plot, biplot, and 3d plot for visualizing 3 principal components. We can additionally obtain a plot that illustrates the decision boundary between the gender classes, just like below:

decisionplot <- function(model, data, class = NULL, predict_type = "class",
  resolution = 100, showgrid = TRUE, ...) {

  if(!is.null(class)) cl <- data[,class] else cl <- 1
  data <- data[,1:2]
  k <- length(unique(cl))

  plot(data, col = as.integer(cl)+1L, pch = 16, ...)

  # make grid
  r <- sapply(data, range, na.rm = TRUE)
  xs <- seq(r[1,1], r[2,1], length.out = resolution)
  ys <- seq(r[1,2], r[2,2], length.out = resolution)
  g <- cbind(rep(xs, each=resolution), rep(ys, time = resolution))
  colnames(g) <- colnames(r)
  g <- as.data.frame(g)

  ### guess how to get class labels from predict
  ### (unfortunately not very consistent between models)
  p <- predict(model, g, type = predict_type)
  if(is.list(p)) p <- p$class
  p <- as.factor(p)

  if(showgrid) points(g, col = as.integer(p)+1L, pch = ".")

  z <- matrix(as.integer(p), nrow = resolution, byrow = TRUE)
  contour(xs, ys, z, add = TRUE, drawlabels = FALSE,
    lwd = 2, levels = (1:(k-1))+.5)

  invisible(z)
}

df_famd11 <- data.frame(famd$ind$coord[,1],famd$ind$coord[,2],data$gender)

model <- knn3(data.gender ~ ., data=df_famd11, k = 25)
decisionplot(model, df_famd11, class = "data.gender", main = "Biplot of Gender with Decision Boundary",xlab="Dim 1",ylab="Dim 2")

FAMD dimensionality decision boundary

PCAmix

Just like FAMD, PCAmix is another option that works well for datasets with a mixture of qualitative and quantitative variables. It includes ordinary principal component analysis (PCA) but also real with multiple correspondence analysis as special cases.

The specification for this function differs slightly from the other two methods above. Here, we will separate the qualitative (factor) and quantitative (numeric) data in our function call (refer to the library PCAmixdata for a full description of this function):

pcamix <- PCAmix(data_num,data_fact, ndim=3, rename.level=TRUE, graph =FALSE)

Since we’re using a third party library, we may need to write custom code to generate the scree plot manually:

Proportion <- pcamix$eig[1:5,2]

barplot(Proportion, main="Scree Plot",
   xlab="Dimensions", ylab="Percentage of explained variance",ylim = c(0, 10))

As well as the custom code to generate our contribution plot:

a<- rbind(pcamix$quanti$contrib.pct, pcamix$quali$contrib.pct)

b <- sort(a[1:50,1])

contributions <- b[46:50]

cc <- data.frame(contributions)

cc$variables <- row.names(cc)

ggplot(cc, aes(x=reorder(variables,-contributions), y=contributions)) + 
  geom_col() + 
  theme(axis.text.x = element_text(angle = 20)) + 
  ggtitle('Contribution of Variables to Dim-1') + 
  theme(axis.text.x = element_text(angle = 20)) +
  theme(plot.title = element_text(size = 14, face = "bold"))

Conclusion

Whenever we are assigned work from clients, the data scientists and analysts ought to first translate business goals to concrete research objectives. In this project, we started off with the “high level idea” of finding possible structures, or patterns in our data that could help in predicting the gender class (male, or female) of device owners given a handful of handset-level data. Understanding which among the hundreds of variables could be good predictors help in the model construction when we move on to develop a classification model. 

PCA method is best used on quantitative data and it performs an orthogonal linear transformation on the data. FAMD and PCAmix on the other hand are more suited for qualitative and quantitative data (mixed data).

When we go back to take a closer inspection at the visualization, we also learn that a pretty clear decision boundary is formed even when the dimensional space has been greatly reduced (from the original 140 dimensions to 2, or 3 dimensions). 

Supertype on Social Media

Connect with the Author