Skip to main content

Blog entry by Srishti Srishti

Employee Churn Rate Prediction: Employee retention using data analytics

Employee Churn Rate Prediction: Employee retention using data analytics

HR Analytics with machine learning: classification and regression tree applied to a company’s HR data: The Great Resignation era, an economic trend, triggered by the COVID-19 pandemic, has changed the relationship between offices and workers. This article explains the use of HR analytics in overcoming this trend.

People are expected to give their all – labor, passion, and time – to their jobs. But if their jobs don’t give back enough, they will leave. As have 4.5 million burned-out American employees who quit their jobs since November 2021 due to low satisfaction.

HR analytics refers to the collection of employee data, its analysis, and reporting actionable insights. Information from HR analytics can be used to: 

  1. generalize standards for working conditions to avoid burnout
  2. assign projects that align with employees’ strengths for better performance 
  3. launch initiatives that align with career aspirations for higher satisfaction
  4. evaluate performance to uncover sources of talent

So, corporations are using data to retain talented employees, increase employee satisfaction, boost company loyalty, and reduce hiring and retention costs.

Classification and regression trees (CART) enable companies to characterize loyalty and identify who is likely to resign. Not only that, but it also reveals the conditions that affect their loyalty and/or make them unsatisfied.

When you perform CART, you are able to identify two paths: what makes an employee loyal, and what makes an employee leave. Each path has a set of attributes that leads to a greater sense of loyalty, as well as those that lead to higher dissatisfaction.

Then, each of these attributes is ranked in order of importance to know which has a greater influence on the employees’ decision to stay or to leave.  There are different solutions available in the market for HR analytics, but we will apply the CART algorithm using the R programming language.

This is a simulated dataset with several measures that can be used to predict which employees are at a risk to leave the company. Here, the CART algorithm unfolds actionable insights in the following steps:

  1. Business case
  2. Data exploration and preparation
  3. Split data into training and validation
  4. Develop an initial model and interpret two complete paths
  5. Identify important variables

You can follow along the steps from this notebook to perform it on your device by clicking here.

  1. Business case

In this case study, we will visualize two paths of attributes that affect loyalty and dissatisfaction among employees. The business case is formed around the question: Can we predict those employees who are likely to leave the organization?

  1. Data exploration and preparation

There are eight continuous variables and two categorical variables in the data set that offers information about 14999 employees. Continuous variables are those with numerical values, and categorical variables group things into category headers, like “Departments” that can have values similar to sales, marketing, consumer, operations, and so on.

 The variables are explained in the data dictionary below:

  1. satisfaction_level: Satisfaction ratings of the job of an employee
  2. last_evaluation: Rating between 0 to 1, received by an employer over their job performance during last evaluation
  3. number_projects: Number of projects an employee is involved in
  4. average_monthly_hours: The average number of hours in a month, spent by an employee at the office
  5. time spent_company: Number of years spent in the company
  6. work_accident: 0-no accident during employee stay, 1-accident during employee stay
  7. promotion_last 5 years: Number of promotions in the employee's stay period
  8. resigned: 0 indicates employee stays in the company, 1 indicates-the employee who resigned from the company
  9. salary_grade: Salary earned by an employee
  10. department: the department to which an employee belongs

 We will plot the variables in order to explore:

Data Science EDA

  • Satisfaction level: Most employees are highly satisfied.
  • Last evaluation: Most employees are good performers with 75% of the data set being evaluated between 56%-87%.
  • Number of projects: most employees do a reasonable number of projects.
  • Average monthly hours: Most employees spend, fairly, a higher number of hours at work.
  • Time spent in company: Fewer employees stay beyond 4 years.

Let us take a second glance at the binary, continuous variables: work_accident, resigned, and promotion_last_5years.

Frequency of accidents at work:

frequency of accidents at work

  • Most employees (85.5%) did not have an accident

 Frequency of resignations:

frequency of resignations

  • Most employees (76.2%) stayed with the organization and did not resign.frequency of promotions

Frequency of promotions in the last 5years:

  • Most employees (97.9%) did not receive a promotion in the last 5 years.

 Exploring categorical variables: salary_grade, and department.

 Salary grade of employees:

salary grade of employees

  • 8.2% of the organization form the top level with the highest pay, 42.9% of the employees are paid a medium salary and 48.7% of the employees are paid a low salary.

 Number of employees in each department:

number of employees in each department

  • The department ‘sales’ has the highest number of employees at 27% and management the lowest which forms only 4.2%.

  1. Split data into training and validation:

 We will split the data into two parts: training and validation but let’s understand why we do that. We train humans to perform a skill. Similarly, we can train the algorithm to perform. To train a human, we let them practice towards perfecting their ability. But for algorithms, we input data so that they can learn.

The algorithm identifies the pattern in the data, learns the intricacies and nuances of that pattern to build an ability to predict accurately. Therefore, we split our dataset so that we can test the trained model on a representative dataset where we already know the correct predictions. This will let us know how well the model that we trained is performing.

But before we train the model, we will create factors of the following variables:

  1. Department: Represents the number of employees in each department. There are a total of 10 departments. Department Sales has the highest number of employees at 27% and management the lowest which forms only 4.2%.

  1. Salary grade: Represents the salary as low medium and high. 8.25% of the organization are top level with the highest pay, 42.9% of the employees are paid a medium salary and 48.7% of the employees are paid a low salary.

  1. Resigned: In this, 0 denotes who stayed and 1 denotes who resigned from the organization.

We create factors when we wish that each type within a variable be treated as a category. For example, in R’s memory, factorizing the variable ‘department’ will mean treating, ‘low,’ ‘high,’ and ‘medium’ as individual categories. Thisensures that the modeling functions treat each type correctly.

  1. Develop an initial model

The initial model is developed on the training data set.

data science CART initial model

How to read the tree:

  • 1 denotes 'resigned,' and 0 denotes 'stayed'
  • at the top when no condition is applied on the training data set (train) the best guess is determined as 0 (stayed)
  • of the total observations 76% did not leave and 24% left

Interpreting Two Complete Paths:

Path 1: Will Not Leave (loyal)

  • first condition: satisfaction level >= 47%
  • second condition: time_spend_company < 5 years
  • third condition: last_evaluation < 81%

Hence, those who did NOT leave are highly satisfied, have spent at least 4 years in the organization, and are good performers with an evaluation of at least 80%.

Path 2: Will Leave (resign)

  • first condition: satisfaction_level < 47%
  • second condition: number_project >= 3 projects
  • third condition: last_evaluation >= 58%

Hence, those who leave are lowly or moderately satisfied, have a workload of 3 or more projects with their performance being evaluated at least 58%.

  1. Identify important variabledata science variable


Characterizing loyalty:

11,428 employees, which is, 76% of the data set are loyal. Three conditions that affect loyalty are:

  • a high level of satisfaction (satisfaction_level >= 47%)
  • have spent at least 4 years in the organization (time_spend_company < 5 years)
  • are good performers with an evaluation of at least 80% (last_evaluation < 81%)

Characterizing left:

3,571 employees, which is, 24% of the data set left. Three conditions that affect ‘resigned’ are:

  • low or moderate satisfaction (satisfaction_level < 47%)
  • have a workload of 3 or more projects (number_project >= 3 projects) and
  • their performance being evaluated at least 58% (last_evaluation >= 58 %)

HR analytics, the provenance of a few leading companies, a decade ago, is a solution that is being widely applied now by several growing businesses to uncover surprising sources of talent and counterintuitive insights about what drives employees to be loyal towards their organization. We hope this encourages you to leverage the power of HR analytics to retain talent and save hiring costs. You can follow along the steps from this notebook to perform it on your device by clicking on the button below:

Click For Code