Topic Extraction From Text: Approaches And Learning ​

My Learning Journey Towards App Review Topic Extraction

Andrew William data scientist at supertype

Andrew William, Unit 1

Why topic analysis and specifically topic extractions aren’t always straightforward, and how we approach it. 

Background: Topic Extraction on App Reviews

When you create an app and release it on marketplaces such as the Google Play Store or iOS App Store, you are putting your app and your services out for billions of people to download and consume. As a result, it is no wonder that good apps can quickly garner hundreds and thousands of users, spurning an entire industry (“mobile growth”) and profession (“user acquisition”). 

Subsequently, apps become a gold mine for data – be it about user preferences, locations, etc. All this data represents a rich set of features, that when analyzed in the realm of data science, will provide all sorts of valuable insights for businesses. In this article, I will provide a slightly technical glimpse into my internship journey, as I led one such app review analysis project for Supertype. I will revolve the discussion around app reviews, specifically from Google Play Store, because this is what my team and I focused on. It is worth noting that this article does not focus on in-depth technical solutions, but instead, highlights the research/analysis thought process and journey. At face value, combining app reviews with other app information provides a substantial insight into how customers perceive and use your mobile app. For example, analyzing the number of 1-2 star ratings alongside the dates of reviews posted could give us an idea of how satisfied your customers are with your app over time. When combined with other Natural Language Processing (NLP) techniques, app publishers can unlock immense value from this wealth of data.

Problem Formulation

Back to our previous example, let’s say we know that the app does particularly bad in 2020. Now comes the question, ‘what aspects of the app and service did the customers most dislike?’ If we’re being more ambitious, this then extends to follow up on incredibly specific questions like, “between shortening the current delivery time and adding one popularly requested payment channel, which one leads to biggest gain in customer satisfaction (read: business advantage)?”

The benefits are numerous for app and mobile game publishers. For example, knowing which problem to prioritize between the occasional app crashes, the unnecessary long tutorial screen, and a more hassle-free checkout screen, from the user’s perspective, mean a more targeted effort at maximizing customer happiness at every step of their app release.   

For my unit, which was tasked with NLP research, this translates to a topic extraction problem from the app reviews data. The topic extraction also happens to be one of the initial steps within Supertype’s app analysis pipeline. Our unit of 5 data scientists dedicated a lot of time to it.

The Dataset

At the time of this writing, our unit has moved on to topic extraction from text of reviews within the e-commerce industry, so we scraped the web and obtained around 70k reviews from various e-commerce players. From those reviews, we wanted to discover ‘negative’ topics that are related to customer dissatisfaction, so we decided to keep only the 1-3 star reviews, and ended up with a dataset of 10k reviews.

We make the liberal assumption that 1-3 star reviews are considered ‘negative’ reviews. In reality, a different approach such as a trained sentiment classification model would be better for this purpose.

 

Supervised or Unsupervised Algorithms

Before beginning the research cycle, we thought about solving the problem by deciding between a supervised approach and an unsupervised approach. With a supervised approach (text reviews paired with topic labels), we could frame the topic extraction problem as a text classification problem. We could then train machine learning classifiers, like SVMs, Random Forests, Gradient Boosted Trees, etc, or deep-learning based classifiers like LSTMs, Transformers, etc, for sequence (text) classification. This would have been the ideal approach, except for a really major caveat – our 5-person unit would need to label the 10,000 reviews manually, since all of them were unlabeled. Sadly, this sort of resource constraint is all too familiar of problem when it comes to modeling real-world data.

This nudges our team onto the direction of using an Unsupervised approach.

Unsupervised Attempt

For our initial unsupervised approach, we framed the task as a topic modelling problem. Hence, we started off experimenting with multiple variations of classic models such as:

  1. Latent Dirichlet Allocation (joint posterior probabilities estimation, using topic-document, topic-word distributions, Dirichlet distributions) with Gibbs Sampling
  2. Non-Negative Matrix Factorization (also a form of dimensionality reduction technique, using, well, matrix factorization)
  3. Guided LDA (basically LDA, but seeding some topics with keyword priors)
  4. Latent Semantic Analysis (similarity-based matrix factorization, leveraging Singular Value Decomposition).

Problems that surfaced with the initial attempts

We carried out numerous experiments with small variations within model parameters and types, but found that none of the results were satisfactory.

After digging deeper, I found numerous research articles that shed light on our failures. They pointed out that traditional topic models simply under-perform for short text reviews (in our case, average word length was 6-7 after preprocessing), because short reviews have little keywords. This consequently means little word co-occurrence information. This ultimately presents an unexpectedly great challenge for even modern researchers (really noisy data). 

The second main problem was that since all our data was unlabeled, we did not have a proper metric or way to benchmark our models’ performances. For the earlier models, we deemed their satisfaction unsatisfactory by looking at the word makeup of topic clusters formed and eyeballing the sample topic distributions generated from the dataset.

The third problem was that, a completely unsupervised approach would not conveniently yield consistent topics, which was what we needed for the overall pipeline.

Overcoming challenges

At this point, we were frustrated as the research journey for our topic extraction process  grew rockier, and our efforts felt futile. Fortunately, we were aided by an extraordinary mentor, who constantly guided us on our next possible actions. Thanks to him, we eventually ended up solving the problem 2 (benchmark problem), through our variation of the Mean Average Precision@K metric. This might sound weird, because MAP@K is commonly a recommender system metric, but we had to get creative with our limited options, and so far it’s been working like a charm. We also took it to ourselves to hand label a small portion of our dataset as the benchmark. 

To address the other two problems, there were two ways I envisioned on how we could proceed. Firstly, was to try some tweaks and techniques that other NLP researchers have suggested, such as document pooling, semantic topic modeling, etc. Secondly, was to do a complete re-base and re-frame the problem as a semi-supervised problem. 

After much discussion among ourselves and with our mentor, we went with the second path, because it would give us control over the topics formed, fixing problem 3 (topic inconsistency), and it provided some possible workarounds to problem 1 (underperforming classic topic models).

Semi-supervised baseline

With a semi-supervised mindset, I got creative and eventually came up with an LDA + Word2Vec method, building on my mentor’s suggestion to create a model that aims to capture semantic meaning, in addition to LDA models. 

This model relied on finding controlled topic clusters using LDA, by manually mapping the clusters back to a fixed set of topics the unit expected to see (this is the supervised part). In order to expand each topic’s vocabulary, I took the top n-words for each topic cluster and ran similarity searches using Gensim’s Word2Vec methods. Finally, I implemented a compound scoring method by combining the probability distribution values obtained from LDA, and the cosine similarity values from Word2Vec similarity searches.

This model became our unit’s base model, since it served the purpose of consolidating a general semi-supervised approach to our difficult situation. In light of that, I was content with my work.

What’s Next?

Unfortunately, this marked the end of my internship with Supertype. My colleagues branched off from my work, some expanding further on a pure word2vec approach running on document-level similarities with topics, others looking towards attention-based, or transformer-based, semi-supervised models, like those used in few shot learning. 

Regardless of their methods, I am confident in my amazingly talented colleagues’ abilities to finish this problem, so make sure to check follow their progress and future updates on Supertype. 

Final Thoughts and Conclusion

Through this article, I have shared my personal learning journey as I methodologically approached a real-life and fairly common data science problem. At the same time, I exposed some of my own mistakes in my approach.

To be fair, things might have gone a lot smoother had I made several considerations earlier on. Namely: the benchmark problem associated with unsupervised methods, as well as the issue of properly framing the NLP task from the problem statement. There were also multitudes of other lessons I graciously learnt from my mentor, and my colleagues, but that would stretch this article out for too long, so I shall save that for another time.

Thank you for taking the time to read this, and I hope you learned something from this read. 

If you’re interested in chatting more about this problem or anything about data science in general, feel free to connect with me over LinkedIn where I also plan to post more educational/journal-style articles in the future.


P.S : I’d like to give special thanks to my mentor, Steven Christian, for helping me proof-read this article, and for graciously sharing his knowledge to me and my unit over my internship. Feel free to connect with him over LinkedIn too, he’s extremely knowledgeable in data science (with a focus on Computer Vision), really friendly, and an excellent teacher overall. son

Supertype on Social Media

Connect with the Author