Goal
Diversity, unconscious bias in the workplace and, in general, the way companies treat their employees are a very important topic.
Data science can help discover potential discriminations by looking at the data and see if there are segments of employees that are treated worse.
Challenge Description
There has been lots of talking about diversity in the workplace, especially in technology. The Head of HR at your company is very concerned about that and has asked you to analyze internal data about employees and see whether results suggest that the company is treating all its employees fairly or not.
Specifically, she gave you the following tasks:
- In the company there are 6 levels. Identify, for each employee, her corresponding level.
- How many people each employee manages? Consider that if John directly manages 2 people and these two people manage 5 people each, then we conclude that John manages 12 people.
- Build a model to predict the salary of each employee.
- Describe the main factors impacting employee salaries. Do you think the company has been treating all its employees fairly? What are the next steps you would suggest to the Head of HR?
PS: you can assume the data for this challenge is clean(e.g. no types, no mismatch when performing joins)
Data
employee_id | boss_id | dept | |
---|---|---|---|
0 | 46456 | 175361 | sales |
1 | 104708 | 29733 | HR |
2 | 120853 | 41991 | sales |
3 | 142630 | 171266 | HR |
4 | 72711 | 198240 | sales |
employee_id | signing_bonus | salary | degree_level | sex | yrs_experience | |
---|---|---|---|---|---|---|
0 | 138719 | 0 | 273000.0 | Master | M | 2 |
1 | 3192 | 0 | 301000.0 | Bachelor | F | 1 |
2 | 114657 | 0 | 261000.0 | Master | F | 2 |
3 | 29039 | 0 | 86000.0 | High_School | F | 4 |
4 | 118607 | 0 | 126000.0 | Bachelor | F | 3 |
Skills Covered
- Data Wrangling(Pandas)
- Data Visualization(Seaborn, Matplotlib, etc)
- Machine Learning(Sklearn, RandomForest, etc)
- Insight Extraction(Feature Importance, PDP Plot)
Interesting Findings
Footnote
This project is my solution to one of the Data Science Challenges by Giulio Palombo (Datasets are not provided here)