Automated Keywords Extraction from Job Descriptions

What are the key Skills for a data analyst based on job descriptions on job portals like indeed.com

Eveline Surbakti, Unit 5

Using Natural Language Processing (NLP) techniques and GloVe embeddings to study the keywords found on job descriptions online for a Data Analyst role

What makes a good Data Analyst candidate?

As a former business analyst I was used to thinking about our relationship with data and the role it plays in facilitating our decision making. Having completed my Masters degree in data and computational science last year, I am now making the transition into data science and analytics. In reviewing the various job descriptions of data analysts posted on LinkedIn, Indeed and Glassdoor I constantly find myself picking out keywords from the job summary; then, it occurs to me that this is a laborious process that can be automated through some form of natural language processing (NLP).

This article is a summary of that experiment.

Acquiring the Dataset

We will be using the dataset from Indeed (indeed.com). I start off by writing a simple scraper using the rvest library. To help illustrate this process, consider the following job posting for a Data Analyst position. Take note of the colored annotation 1 to 4 in the image below:

  1. The link to our target page; In this case, it’s a listing of all postings for the query “Data Analyst” found on the Ireland version of Indeed.com
  2.  I use a simple Chrome extension to help retrieve the right css selector without too much hassle
  3. Select the area of the page
  4. Copy the css selector, in our case it is the “summary class” (.summary)

So in short, our scraper follows the routine above to collect the information and store them into a DataFrame, filtered specifically for data analysts openings in Dublin, Ireland where I reside. I have also stored the years of experience required as a numeric value and done the necessary preprocessing that is typically involved in a language processing task:

  • Removing stop words (words that have little value to our research objective)
  • Remove punctuation, numbers, extra whitespaces
  • Converting character strings into lowercases

Exploratory Data Analysis for Text

We begin with a basic but common NLP technique (“bag of words”) that constructs features based on terms / words (will be used interchangeably) frequencies. We use these features to train the classifier given a collection of text (also known as a corpus.)

Years of Experience for Data Analyst positions

The first question was “how many years of experience do employers require for their data analysts positions?” Because I scraped the website twice, once for all the openings in October and another for the following month, the distributions are plotted for each month:

One can observe that most of the job openings for data analyst positions seek out for candidates with two to three years of experience.

Job listings to frequency tables

If we look at only the top 10 most frequent words from the November set, these are the top words by frequency (count):

Word          n 
data        909 
experience  150 
analyst     133 
analysis    118 
will        113 
business    112 
team         85   
analysts     84
work         69 
role         66 

What’s interesting was that the word “experience” turns out to be the second most frequent, which underlines the emphasis that recruiters in placing when it comes to their candidates for the data analyst role.

We can also compute a frequency table for the recruiting companies. Based on what we scraped, the top 3 companies are recruitment agencies with a big presence in Ireland. These are the companies that are the most heavily represented for the word “data”:

Company Name                    Word         n 
Morgan McKinley                 data        36 
Eolas Recruitment               data        25
Accenture                       data        19
Regeneron                       data        19 
Reperio Human Capital           data        18 
Eurofins Central Laboratory     data        17
Segment                         data        15 
TikTok                          quality     15 
Red Tree Recruitment            data        14 
Red Tree Recruitment            recruit     13 

Words Visualization

We can also take a high level snapshot of word distributions. Unsurprisingly, the word “data”, “analyst”, as well as common functions relating to the job (“manage, experience, develop”, “process”, “compliance” etc) are the most common words in the job description for this role:

Unigrams and Bigrams

From the wordcloud visualization above, one could see how we sometimes fail to capture the full semantic of a phrase. “Data Management” or “Version Control” as a phrase means very different things when you split them up and consider the words as single terms (we call them “unigrams”). When we have a pair of consecutive terms, we call them “n-grams”. So a pair of two consecutive terms that make up a key phrase are called 2-grams, or bigrams. 

These are a sample of unigrams and bigrams from our dataset (omitted the full result for brevity):

[1] "analysing"             "drive"                 "sets" 
[4] "trends"                "analyst will"          "data_analysts" 
[7] "ensuring"              "experienced"           "maintain" 
[10] "manager"              "modelling"             "operations" 
[13] "privacy"              "projects"              "protection" 
[16] "related"              "solutions"             "sources" 
[19] "understand"           "content"               "data_quality" 
[22] "development"          "dublin"                "product" 
[25] "required"             "business_analyst"      "internal" 
[28] "manage"               "use"                   "data_integrity" 
[31] "experience_data"      "leading"               "process" 
[34] "teams"                "test"                  "using" 
[37] "based"                "data_analyst"          "system" 
[40] "clients"              "customer"              "design" 
[43] "insights"             "market"                "processes" 
[46] "responsible"          "security"              "conduct" 
[49] "join"                 "company"               "global" 
[52] "new"                  "requirements"          "data_analytics" 
[55] "provide"              "reports"               "within" 
[58] "information"          "looking"               "technical" 
[61] "compliance"           "integrity"             "key" 
[64] "large"                "risk"                  "tools" 
[67] "years_experience"     "complex"               "identify" 
[70] "analytics"            "including"             "sql" 
[73] "knowledge"            "review"                "systems" 
[76] "skills"               "understanding"         "financial" 
[79] "ensure"               "analytical"            "reporting" 
[82] "working"              "ability"               "years" 
[85] "data_analysis"        "management"            "strong" 
[88] "support" 
... [100] "data management"
Similarity Analysis with GloVe Word Vectors

We can also find words that are most similar to the word “Business Analyst” or “Data Analyst” to help us a get a sense of topics or skills relating to our job-search. To compute similarity, I used Cosine distance and then perform the visualization of the top 10 words closest in vector space to “business analyst” or “data analyst”:

It seems that both of these jobs have a fairly high requirement for years of experience (“years”, and “experience” comes in on 6th and 5th place respectively), but they diverge in other areas: positions for business analysts use words such as  “client”, “financial”, and “systems” while positions for data analysts have the closest terms as “data”, “modelling”, “sql” and “information”.

Plotting GloVe Word Vectors using Multidimensional Scaling

A vector with 100 dimensions would not be very helpful in terms of interpretation, so we will use MDS to produce a low dimensional representation of the data — while aiming to preserve the distances between points in the new representation. MDS essentially produces a ‘map’ of the observations onto new points that are in a lower dimensional space.

Once we’ve applied MDS with Euclidean distances between these word vectors, we can visualize them and use this new representation to uncover groupings of words into broader subject areas.

I expected terms close to each other in this reduced vector space to be semantically similar, meaning they are commonly found within the same context and are transposable within the corpus. There are some clear and evident trend from the figures: The terms seem to be stratified primarily by frequency.

Where:

  • Higher frequency terms are more separated and isolated. These terms be placed within the multidimensional space in a location that depicts its distinct meaning.
  • Low frequency terms tend to aggregate around each other, often overlapping to show how close they are. These terms cannot be determined as precisely, so their encodings tended to settle closely to each other without much differentiation. Sometimes we ‘could not’ find them.

Here are a summary of my top findings from our study of most common phrases in the job openings for data analysts (bold emphasis are keywords resulting from the earlier exercise):

  1. business and business related analysis are important for an  data analyst, (sometimes) as a driver of business, a data analyst should have business acumen and ability to derive value for the business 
  2. years, experience and years_experience are vital in job description of a data analyst, highlighting the emphasis on employment history and experience
  3. SQL is important
  4. Also important is data treatment, ranging from security (privacy and security), to protection, integrity and compliance 
  5. Understanding the requirement is another important aspect of the job for any budding data analyst.
  6. Data analyst should be able to handle complex and large data set.
  7. Knowledge and throughly understanding about operation and process of a business (relating to the first point)
  8. Teamwork is important, as analysts often plays a supportive role and are expected to work in teams

 

Supertype on Social Media

Connect with the Author

Exit mobile version