Missing Data

Screen Shot 2022-04-03 at 3.14.21 PM.png

Before taking any actions, we first conducted a test to make sure that these values are missing at random or not. Using mcar test in R, we have the following hypothesis:

H0: Data is missing completely at random

Ha: Data is missing not at random

Screen Shot 2022-04-03 at 3.15.33 PM.png

Screen Shot 2022-04-03 at 3.15.25 PM.png

The p-value is 1 in both datasets, so we fail to reject the null hypothesis and therefore, the missing values are missing completely at random.

Handling Missing Data

The following approach was used to deal with the missing values:

Responses for those participants who did not respond to the questions about their gender and/or their classification level were removed, as we do not wish to continue with those responses. 0.01 of the total responses in 2014 and only 4 rows in 2019 were removed after this.

Then, we examined the percentage of missing values in all rows, in both datasets. The rows with missing data more than 5% were removed from the datasets.
Remaining responses with percentage of missing values below or equal to 5% were chosen for imputation with “mice” and the missing values were replaced with median.

After performing the above steps, we have no missing data in either of the years.

Screen Shot 2022-04-03 at 3.16.10 PM.png