College student mental health has been getting more and more attention recently. Despite many solutions have been proposed, this is not an easy-to-solve problem. For this project, we are seeking to build a prediction system that can tell when students are easily get stressed out, and thus render possibility to certain interventions which would help college students to maintain a healthy mental condition. We would also expect this project to be a start- ing point in terms of helping people to better understand College student stress patterns as well.
In this project we use StudentLife Dataset, which is collected by research group of Dartmouth College. The data set contains a wide variety of data including sensor data, EMA data, survey responses and educational data. Among the large amount of data, We mainly focus on data concern- ing student mental condition such as Stress Level, Enthusiasm, Calm, etc. and their daily behavioral data, including Sleeping Hours, Working hours, Exercise, Social, etc. After diving into the data, we can summarize the data shape as follow.
Since our data set contains quite a limited number of records – only 60 students were involved in the study – it would be very hard for us to use these 60 students as sample to predict other student’s stress level. Therefore, we decided to switch our study focus a bit, from predicting the stress level of each individual student, to predict the overall trend of students stress level as a whole.
First, we would summarize the stress value and average it based on the participants amount. The next step we would analyze the average stress value change trend by time. Then we would use the Granger Causality Analysis to detect the causality between stress and other features. This is because we are interested in seeing how students stress pattern in general is formed, and where would it lead to. Hopefully we would find out the most significant feature that impact stress level.
Meanwhile, we would build an Autoregressive(AR) Model to predict student stress level based on their historical stress data. In order to predict student stress in a more accurate manner, we also introduce auto-correlation function (ACF) to measure the coefficient of correlation between student stress values in a time series.
Granger Causality Analysis
Before looking into the analysis, we are curious about whether the increase of stress level would have effect on students sleeping hours, or it is the other way round, that sleeping hours would influence students stress level. The result of our Granger Causality analysis shows that, it is in fact students’ stress level would demonstrate an significant influence on their sleeping hours with p-value = 0.0547 when lag = 5. In comparison, the p-value is only 0.182 under the hypothesis that sleeping hours would have effect on student stress level. Our results also show that students’ stress level would Granger Cause exercise. In other words, students’ work out hours display an increase soon after they get stressed out. Also, working hours seem to be the most primary leading factor of college students' stress level. Surprisingly, as college students get more stressed, their enthusiasm scale also goes up, and so does the calmness scale.
Prediction on student stress
Stability & Auto-correlation testing
We first calculated the moving average values of student stress level with a sliding window size of 3. This helps create a smoothed version of the original data. It seems that students stress level would peak during the midterm of the surveyed period, and it reaches another peak almost at the end of the survey period.
The rolling mean and rolling standard deviation of the time series data looks much more smooth than the original stress time series data (ts stress). However, it is hard for us to tell how stable the time series actually are, if we simply rely on visual observations. In order to test the stability of stress data, we conducted a Dickey- Fuller Test on its moving average differences, and get a small p-value of 0.024. Based on this, we can conclude that the ts stress is mostly stable.
We also used Auto Correlation Function (ACF) to measure the correlation between ts stress and itself. When lag is smaller than 7, the time series is positively correlated with itself.
Fitted Models on student stress
We built an Auto-regressive Model (AR) to fit the time series data, with parameter set- ting lag = 7. Below is the fitted values (red line) as compared to the original stress data (blue line), and the AR model would result in residual sum of squares (RSS) of 17.383. It seems that the AR model pretty much capture the trend of students stress level.
We would also like to if other models would outperform the AR model, so we built an Moving Average Model (MA) to see if the results get any better. It turned out the RSS value of MA model is higher than AR model, which reaches an RSS value of 22.210. It can also be implied from the plot that, MA model fails to capture some of the trends of ts stress.
If we applied an integrated approach, by combining Moving Average and Auto Regressive model, it actually shows an even better result with RSS reduced to 14.129. The fitted values (red line) of model can capture most of the trend features of the ts stress data.
Model Prediction on student stress
Our result shows that AR model would yield the best performance among all, with parameters set as lag = 1 and sliding window size = 2. The comparison results are as follow: AR (RMSE: 0.6704), ARIMA (RMSE:1.0134) , MA (RMSE: 1.0209).