Data Science for App Review Analysis
Mining App Reviews: The core skills to make the most out of text reviews
Miftah Ahmad Choiri, Unit 4
A bird’s eye view on working with app review analytics using various Natural Language Processing (NLP) and data science methodologies.
Background: Learning from Unhappy Customers
The most unhappy customer are the greatest source of learning.”
Bill Gates, Founder of Microsoft
In life and in business, constantly seeking feedback and plotting corrections is the only way of survival. Roughly 1% of customers will leave a review for a product or service they’ve used, and often times many of these reviews will be negative.
To a business, these represent a constant source of challenges — a recent survey shown that 57% of consumers will avoid a business that has negative reviews online — as well as opportunities. After all, what richer, truer, more accessible source of information are there, than directly from the customers themselves?
I wrote this article to describe the thought process and the wide array of data science tools that is at our disposal when it comes to mining these customer reviews, and offers my own take on what I felt was the most important skills when it comes to working with text and natural language processing (NLP) in general. It’s a documentary of my learning journey since my employment with Supertype 3 months ago, and will focus on the “what” instead of the “how”.
1. Problem Solving & Critical Thinking
It’s important to obtain a high level idea of the main problem at hand before we dive into the project, a procedure I like to call “problem framing”. A framework of choice is DSPA (define problem – structure problem – prioritise issues, action plan), a technique commonly deployed by management consultants.
A brief explanation for the framework above:
- Start with Defining the Problem, you need to know the specific problem that needs solving and the ideal outcome (target); the target should be SMART (Specific, Measurable, Actionable, Realistic and Time-bound). In our case, our definition of the problem is to derive actionable insights for business owner from their app review data using natural language technique and analytical approach in under a month.
- After the problem statement is defined, the second step is to Structure the Problem; You can use MECE issue tree framework to get as many solutions as possible to tackle the root cause of the defined problem.
- Once we get the structured issue, we can make a Prioritization Matrix, to classify the problems into 4 quadrants. One can follow the popular Pareto Techniques (80/20 rule) to identify areas most worth investing into — sub-tasks that, when solved, yield the biggest progress to the business objective
- After we get the priority list for all sub-tasks, we can move it into the Action Plan and single-mindedly focus on execution.
2. Data-Driven Analysis Framework
In order to infer accurate trends and patterns, your research need to be meticulous and organised. A data analyst should have the ability to gather the necessary data, and design a series of experiments that are in alignment with the research objective.
A brief summary of each of the 5 key phases in our data analysis workflow:
- Data Preprocessing: In all likelihood, will need to wrangle the data and prepare it before data modeling; This ranges from basic data types conversion to more comprehensive cleansing routines like removing duplicates, correcting erroneous data, dealing with missing values, normalization, imputation etc.). In the context of natural language processing, examples of preprocessing are: erasing or substituting emoji (😊 to “smiley”), expanding the contractions, removing punctuation, removing stopwords, lemmatization, remove short words, get ngrams etc, or dealing with any occurrences of a foreign language text review (translation, omission, etc).
- Data Processing: You will now have to select a more specific processing routine depending on the algorithm or NLP objective. For example, in the case of a textual review classification problem, a great approach is to process the text so they’re suited for the topic modeling task or text classification task. Supposed we’re handling sentiment analysis and sort the review by a normalised rank, we would later be able to use feature-based opinion mining algorithms (High Adjective Count), which helps us discover the topic that has the biggest impact on our ratings, positively or negatively.
- Data Analysis: Data analysis refers to the process of evaluating our data using analytical and statistical tools. The goal of data analysis is to discover useful information and draw inferences that can help us make more informed decisions. In our case, we discover actionable insights by formulating a new action priority matrix by ranking the importance score (how important is this “topic”) and urgency score (how urgent is this “topic”) of each topic (read more about topic modeling / topic extraction on app reviews here).
- Data Visualization: Data visualization is an effective way to gain — and communicate — insights when done correctly. When we’re juggling between the various considerations: including chart design, dynamic combination, two-dimensional charts, three dimensional charts, linkage, drilling, etc, it’s easy to lose sight of the bigger picture. To me, understanding that visualization is ultimately about communication: picking the right presentation to help the reader identify problems and spot opportunities should be the goal. The value of data analysis isn’t in it being read, it is in the actions that it inspires.
3. Business Acumen
Another area of expertise for data analysts is business acumen, a skill that helps ask the right questions regarding the problem at hand, which in turns lead to arriving at the solutions faster.
Data analysts and data scientists with great business acumen find ways to help stakeholders intuit about the problem domain as well as presented solutions, and help link their solutions to the underlying system of operations.
Any business organisations with a sufficiently complex system will immediately benefit from a key team of data analysts with acute domain knowledge: they will navigate the different sea of contexts and help break down big monolithic problems into smaller, more manageable ones.
4. AI and ML (Machine Learning) in Natural Language Processing
For any data scientists diving into text analytics or natural language processing (nlp), a really good foundation would be the core machine learning principles itself.
In essence, the role of machine learning (ML) and artificial intelligence (AI) in natural language processing is to improve, accelerate and automate the underlying text analytics functions that turn unstructured text into usable data and insights. Some examples of interesting approaches in this area are:
|1||Vader Sentiment||(Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media. VADER uses a combination of A sentiment lexicon is a list of lexical features (e.g., words) which are generally labelled according to their semantic orientation as either positive or negative.|
|2||Tokenization||Tokenization is an interesting part of text analytics. A “token” in natural language terms is “an instance of a sequence of characters in some particular document that are grouped together as a useful semantic unit for processing.” Like the roots and branches of a tree, the whole of human language is a mess of natural outgrowths—split, decaying, vibrant, and blooming. Tokenization is part of the methodology we use when teaching machines about words, the foundational part of our most important invention.|
|3||Part of Speech Tagging||(PoS tagging) means identifying each token’s part of speech (noun, adverb, adjective, etc.) and then tagging it as such. PoS tagging forms the basis of a number of important Natural Language Processing tasks. We need to correctly identify Parts of Speech in order to recognize entities, extract themes, and to process sentiment. Lexalytics has a highly-robust model that can PoS tag with >90% accuracy, even for short, gnarly social media posts.|
|4||Categorization and Text Classification||Categorization in text mining means sorting documents into groups. Automatic document classification uses a combination of natural language processing (NLP) and machine learning to categorize customer reviews, support tickets, or any other type of text document based on their contents.|
|5||Topic Modeling (Unsupervised)||This technique identifies words and phrases that frequently occur with each other. Data scientists use LSI for faceted searches, or for returning search results that aren’t the exact search term.|
|6||Word Embedding||is a learned representation for text where words that have the same meaning have a similar representation. It is this approach to representing words and documents that may be considered one of the key breakthroughs of deep learning on challenging natural language processing problems.|
|7||Pre-trained Word Vectors||Pre-trained models are the simplest way to start working with word embeddings. A pre-trained model is a set of word embeddings that have been created elsewhere that you simply load onto your computer and into memory.|
|8||Language Detector||is the problem of determining which natural language given content is in. Computational approaches to this problem view it as a special case of text categorization, solved with various statistical methods.|
|9||Sentence Translator||You can actually do a lot of things with the help of the Google Translate API ranging from detecting languages to simple text translation, setting source and destination languages, and translating entire lists of text phrases. In this article, you will see how to work with the Google Translate API in the Python programming language.|
|10||High Adjective Count||a group of functions for handles sentiment analysis using feature-based opinion mining algorithms (High Adjective Count), which effectively optimizes the score of the nouns to extract the potential key phrases from the text based on the nouns & adj within each sentence|
|11||TF-IDF||is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. It is often used as a weighting factor in searches of information retrieval, text mining, and user modeling. The tf–idf value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently in general.|
5. Statistical Data Visualization
For a data analysts, the ability to tell a compelling story with data is crucial to getting your point across and winning buy-ins. If your findings can’t be easily and quickly identified, then you’re going to have a difficult time getting through to others. For this reason, data visualization can have a make-or-break effect when it comes to the impact of your analysis and research work.
Analysts use eye-catching, interactive charts and graphs to present their findings in a clear and informative manner.
To view the following demo, rotate your phone to horizontal (landscape) orientation so it has enough screen real estate to render:
Since your findings are ultimately presented to others, investing in this area of your data science competencies will significantly improve the quality of your delivery. If you’re a primarily python programmer (data scientist), data visualization packages such as
plotly are all great places to start. You should be proficient in at least one of the aforementioned visualization package.
In my case, I often use seaborn, altair, or plotly at work, which is designed for simplicity with high-level API. They’re convenient and customisable, abstracting away a lot of the lower level details so the analyst can focus on bringing their plots into part of a larger dashboards or delivery medium.
Up to this point, you’ve brought in different tools and methodologies across the data science spectrum to make sense of your data (in our case, customer app reviews on Apple and Google Play appstores). We’ve covered the importance of understanding the problem domain, and a problem-solving framework that emphasises on business acumen, foundational machine learning principles, and a good amount of visualization skills.
What is equally as important at this point is your statistical knowledge. Being a key pillar of machine learning, it’s likely you have some level of familiarity or mastery in statistics already – but statistics help add a layer interpretability to your analysis and research work.
Contrast to machine learning where predictive ability and algorithmic performance takes central role, statistics add a measure of interpretability to your work, and helps you stay on course in doing good, qualified scientific work. It help you answer questions like “how much of what I observe is down to random chance, or sampling error, when it comes to user’s app preference for in-app notifications?” or “to what extent does a customer prefer a one-time purchase option over a subscription model, and whether that’s a repeatable observation across different segments”.
A statistical package you should look at is the
statmodel package which, like the visualization libraries I mentioned above, provides a high-level abstraction so you don’t end up writing long lines of code for a statistical model.
6. Data Storytelling
I’ve emphasised data visualization as a skill set in the point above, but really its importance extends beyond the data science domains. Learning to communicate your results in beneficial, no matter what line of work or business you’re in.
But presenting doesn’t always come naturally to everyone, and that’s okay! Even seasoned presenters will let their nerves get the best of them at times. As with anything else, start with practice and then practice some more until you get into your groove. Then, cultivate a certain mindfulness in how you present information, or rather, how you distill information and break them down into digestible chunks that can inspire action and decision-making.
Until you’ve learned to communicate your findings succinctly and effectively, the work you do will remain in your code notebooks, with limited or no real-life impact.
So as you cultivate your visualization skills, go beyond the development of graphs and other artistic elements, and look at ways to tell a coherent story, to represent findings and to draw attention to any fact-based information (“statistics”) in a coherent manner. In our app reviews example, this involves the task of constructing a short, action-list presentation that puts the business impacts at the centre of our findings.
As the client is in the pharmacy delivery business, it deals with locational constraints (no available pharmacy nearby), stock logistics (ordered drugs are not available), customer service issues (long customer support response time) and other service- or app-related functionality issues (app crashes, freezing screen etc). We completed our work with an action priority matrix, highlighting the most pressing issue, based on severity (biggest impact on customer happiness) and urgency (how urgently does this issue needs to be fixed) from mining customer reviews using a series of NLP techniques briefly mentioned.