Data Accuracy
To check any inaccurate values in the dataset, bar graphs were plotted to see if there are out-of-range values:


From the bar graphs for both the years, we can observe that all our selected questions were 1-5 level questions, and we have two categorical columns for Gender and classification level as well, that are also in the defined range. So, there is no inaccuracy issue in our datasets.
Outliers
To check outliers in the dataset, box plots were plotted for both the years 2014 and 2019:


It can be observed from the box-plots, that in both years 2014 and 2019, there exists some outliers.
Handling Outliers
Mahalanobis distance test was used to detect and remove outliers from both the datasets.
Mahalanobis distance is used to check for responses that were far from the centre of the data and those responses that had a p-value less than 0.001 were removed.
​
Finally, after removing the outliers, we have cleaned datasets for both years.
