-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathsuicide_prediction2.Rmd
360 lines (220 loc) · 14.1 KB
/
suicide_prediction2.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
---
title: "Does the suicide prevention profile predict suicide rates?"
author: "Julian Flowers"
date: "13/08/2017"
output:
pdf_document:
fig_caption: yes
keep_tex: yes
latex_engine: xelatex
toc: yes
html_document: default
word_document:
toc: yes
subtitle: Draft
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE, cache = TRUE, warning = FALSE, digits = 2 )
# library(doMC)
# # parallel for vtreat
# ncores <- parallel::detectCores()
# parallelCluster <- parallel::makeCluster(ncores)
# # parallel for caret
# registerDoMC(cores=ncores)
```
# Introduction
## Suicide and suicide prevention
Suicide rates in men particularly have been increasing in recent years, and suicide prevention is a significant public health issue. As part of efforts to help tackle suicide, Public Health England published a [Suicide Prevention atlas](https://healthierlives.phe.org.uk/topic/suicide-prevention), and a [Fingertips suicide prevention profile](https://fingertips.phe.org.uk/profile-group/mental-health/profile/suicide/data#page/0). These resources present a set of summary statistics showing trends and variation across England in suicide rates alongside a range of predictors and risk factors. PHE also published an accompanying [manual for suicide prevention](https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/585411/PHE_local_suicide_prevention_planning_practice_resource.pdf).
The guidance recommends a focus on 8 areas:
1. Reducing risks in men tackling issues like debt, social isolation and drugs and alcohol
2. Preventing and responding to self-harm
3. Tackling mental health issues in children and young people
4. Treating depression in primary care
5. Safer acute mental health care
6. Targetting high frequency locations
7. Reducing isolation through community support
8. Bereavement support
The data are presented in a number of ways.
* univariate analysis - i.e. single indicators showing variation, comparison with benchmarks (national rates or similar areas)
* as spine charts - which give an at a glance summary of indicator values across all the indicators
* comparison between indicators. The Fingertips profile enables users to draw scatter plots (see example below) which summarise the relationships between indicators. In this case it shows that although suicide rates tend to be higher in areas with higher rates of depression, the relationship is weak.
![Source: Fingertips](images/depression-vs-suicide.png)
The suicide prevention guidance also includes a section on using data including surveillance systems, audit and utilising aggregate data (like the profiles) to aid understanding of the contributors to local suicide rates and build local monitoring systems.
### Predictive modelling
One additional approach is the use of *predictive analytics* - this is core of supervised machine learning^[In machine learning, supervision refers to data where we already know the values we are trying to predict]. This is multi-variate analysis - using all the data to:
1. Build a statistical model which has the best predictive value for suicide rates
2. Aid understanding of which risks or explanatory factors are most important in explaining the variation in suicide rates
This kind of analysis can help us better understand the extent to which we might be able to reduce suicide rates, and explain the variation.
The most widely used algorithm is *multiple linear regression*, but other algorithms such as penalised regression modelling, random forests and support vector machines are increasingly being used as they sometimes have more predictive power, and can filter out the most important variables.
In this analysis we have tried to address the question - "do the indicators in the suicide prevention profile, predict suicide rates?"
## Methods
There are number of steps to perform the analysis:
* Obtain and clean the data (missing values, variable names)
* Select outcome measures
* Construct a testing script to perform the modelling
The code for obtaining the data, cleaning, and running the algorithms is available [here](insert link).
### Data
We downloaded the data for the suicide prevention profile from
[Fingertips](https://fingertips.phe.org.uk) via its *Application Programming Interface* (API) which can be readily accessed using *R* statistical programming software using the `fingertipsR` package.
### Analysis
All analysis was conducted with R Software using the Classfication and Regression Training (`caret`) package. This contains all the tools to split data into training and test sets, run a range of predictive algorithms, extract predictions and important variables, and assess model accuracy.
For the purpose of this analysis, we split the data into 2 parts at random - a training set and test set. The training set consists of 70% of the data and is used to assess the various modelling algorithms and identify those with the best predictive value. These are then applied to the test data to see how well they perform on "unseen" data.
The algorithms we tested for this analysis are:
* linear regression
* glmnet
* random forest
* radialSVM
* lasso regression
```{r libraries and loading script, warning=FALSE, message=FALSE}
library(fingertipsR)
library(tidyverse)
library(ggjoy)
library(WVPlots)
library(stringr)
library(caret)
library(caretEnsemble)
library(govstyle)
source("scripts/get_sui_data.R")
```
```{r download, cache = TRUE}
analysis_final <- analysis_final %>%
janitor::clean_names()
```
Downloading the Suicide prevention profile dataset yields a dataframe with `r nrow(dataset)` rows and `r ncol(dataset)` variables. There are `r length(unique(dataset$IndicatorID))` indicators in the dataset.
### Creating a modelling dataset
For the analysis we need to use the most recent data for each variable. Because of the very long variable names used in Fingertips some simplification is needed.
## Exploratory analysis
### Variation in suicide rates
Men have much higher suicide rates than women and there is more variation in rates between local authorities.
```{r variation, fig.cap = "Variation in male and female suicide rates by time period", message= FALSE, warning=FALSE}
library(viridis)
dataset %>%
filter(IndicatorID == 41001, !Sex == "Persons") %>%
select(AreaName, Timeperiod, TimeperiodSortable, Value, Sex) %>%
ggplot(aes(x = Value, y = reorder(Timeperiod, -TimeperiodSortable), fill = ..x..)) +
geom_joy_gradient(rel_min_height = .05, scale = 2, gradient_lwd = .5) +
theme_joy(font_size = 10) +
facet_wrap(~Sex, scales = "free") +
scale_fill_viridis(name = "", direction = -1, option = "C")+
scale_x_continuous(expand = c(0.01, 0)) +
labs(x = "Suicide rate",
y = "")
```
```{r, fig.cap = "Trends in national suicide rates by sex"}
dataset %>%
filter(IndicatorID == 41001) %>%
select(AreaName, Timeperiod, TimeperiodSortable, Value, Sex, IndicatorName) %>%
filter(AreaName == "England") %>%
ggplot(aes(Timeperiod, Value, color = Sex)) +
geom_line(aes(group = Sex)) +
geom_point() +
facet_wrap(~IndicatorName) +
labs(y = "Suicide rate",
x = "")
```
## Preparing the data
Missing data is a limitation of the dataset. Data is mainly missing due to small number suppression required for disclosure control purposes to release the values into the public domain. To prepare the data for analysis, in general missing data needs to be removed or imputed. For this analysis we have
* Excluded variables with > 10% missing values
* Excluded outcome data with no value
* Imputed any remaining missing values in predictor variables to the mean value
### Looking at variables
#### Summary statistics
```{r}
options(digits = 3)
summary_stats <- analysis_final %>%
select(suicide_rate) %>%
summarise(min = min(suicide_rate),
max = max(suicide_rate),
mean = mean(suicide_rate),
sd = sd(suicide_rate))
summary_stats %>%
knitr::kable(caption = "Summary statistics for suicide rates")
```
Suicide rates vary between `r summary_stats$min` and `r summary_stats$max` per 100,000 with a mean rate of `r summary_stats$mean`.
We can review the relationship between the variables in our analytical data set. This shows that the area suicide rate is not very strongly correlated with any of the variables in the profile and suggests that a model based on these variables may lack predictive power. There are clusters of associated variables, for example related to alcohol.
```{r, fig.cap="Correlation matrix of analysis variables", fig.height=6, fig.width=6}
cor <- cor(analysis_final)
corrplot::corrplot(cor, tl.cex = .7, tl.col = "black", order = "hclust", number.cex = .4, type = "lower", method = "square")
corrr <- analysis_final %>%
corrr::correlate()
```
The next step is to build a model to try and predict suicide rates.
### Simple linear regression
```{r}
fmla <- suicide_rate ~ .
mod_lm <- lm(fmla, data = analysis_final)
lm_tidy <- broom::tidy(mod_lm) %>%
filter(p.value < 0.05)
lm_analysis <- broom::augment(mod_lm, analysis_final)
rsq <- mod_lm %>%
broom::glance() %>%
select(r.squared)
lm_rmse <- lm_analysis %>%
summarise(rmse = sqrt(mean(.resid^2)))
```
Model accuracy can be assessed in a number of ways:
1. The R squared value which is a measure of the proportion the variation in the data explained by the model. The nearer this is to 1 the better the model.
2. Root mean squared error (RMSE) which measures how far on average, the model predictions are from the actual data. The nearer this is to 0 the more accurate the model.
For the simple regression model these are `r rsq$r.squared` and `r lm_rmse$rmse` respectively. The RMSE suggests that on average, the model will predict suicide rates to within 1.36 per 100,000 - which is a relative error of 13%.
From this model the variables which are most predictive of suicide rates are `r lm_tidy$term`.
Finally we can plot our predictions against the actual values.
```{r, message=FALSE, warning=FALSE}
lm_analysis %>%
ggplot(aes(suicide_rate, .fitted)) +
geom_point() +
geom_smooth(method = "lm") +
theme_gov()
```
## Machine learning
We can apply machine learning algorithms to see if we can improve the predictive value of our models and avoid over-fitting - the problem of a model developed in one data set and appearing to perform well, doing badly when it is applied to a test or validation dataset.
Standard ML practice is to split the data into training and test sets using random sampling. Doing this many times over is known as cross-validation and helps optimise the model allowing us to choose the most accurate model for further prediction. Of the algorithms included in this analysis, glmnet and random forest have the lowest RMSE and highest R squared - which admittedly is quite low at 0.5.
```{r, fig.cap="Summary statistics for model accuracy", message=FALSE, warning=FALSE, cache=TRUE}
library(caretEnsemble)
seed <- 1234
index <- createDataPartition(analysis_final$suicide_rate, p = 0.7, list = FALSE)
train.mod <- analysis_final[index, ]
test.mod <- analysis_final[-index, ]
broom::glance(t.test(train.mod$suicide_rate, test.mod$suicide_rate)) ## no sig diff between test and train
control <- trainControl(method="repeatedcv", number=10, repeats=10, savePredictions=TRUE, classProbs=TRUE)
## Set model list
algorithmList <- c( 'glm', 'glmnet', 'lasso', 'rf', 'svmRadial')
set.seed(seed)
models <- caretList(fmla, data=train.mod, trControl=control, methodList=algorithmList)
results <- resamples(models)
## results
## plot results
dotplot(results)
```
### GLMNET
```{r, glmnet, fig.cap= "Variable importance based on best glmnet model"}
enet.mod <- train(fmla, data=train.mod, method="glmnet", tuneLength=5, trControl=trainControl(method="cv", number=10, repeats = 3, savePredictions = TRUE))
```
```{r, fig.cap= "Predictions vs suicide rates in test data"}
predictions <- predict(enet.mod, newdata = test.mod)
rmse <- RMSE(predictions, test.mod$suicide_rate)
r2 <- R2(predictions, test.mod$suicide_rate)
test.mod$predictions <- predictions
test.mod %>%
ggplot(aes(predictions, suicide_rate)) +
geom_point() +
geom_smooth(method = "lm")
```
The most accurate model appears to be glmnet algorithm with a RMSE of `r rmse[1]` and and R squared of `r r2[1]`. From this we can extract a metric known as variable importance. This tells us which variables are the most predictive (based on their appearance as a predictor in each cross-validation run). This shows that the most important predictors (of the variables in prevention profile), are:
* prevalence of severe mental illness
* prevalence of alcohol treatment in people over 18
* homelessness
* living alone (older people and adults)
* marital break up
* children in the youth justice system
* unemployment rates
* depression
This is illustrated in figure:
```{r, fig.cap="Variable importance"}
modvimp <- varImp(enet.mod, scale = FALSE)
plot(modvimp)
```
## Discussion
When people think of machine learning they often think of 'big data', but although our data is not big in terms of size it is often complex and multivariate, and machine learning tools can equally well be applied to these datasets. In this example we have applied a machine learning pipeline to the suicide prevention profile which shows that although none of the models are very highly predictive, glmnet produces the most accurate solution, and we can still build a model which can predict suicide rates to an accuracy of 1.5 per 100,000 (~ 12% error rate). This may be useful, but only half of the variation is explained by variables in the prevention profiles, and may not be thought sufficiently predictive to develop further.
Perhaps of more interest is the variable importance which seems to have face validity in terms of identified predictors of population suicide rates - some of which are amenable to intervention.
It may well be that there are additional data items in Fingertips database which can improve the predictive value of the model - this will require further work. It would be relatively straightforward to add area depression scores for example. To facilitate future iterations of this work we have created a data extraction and cleaning script (note the variable names are not automatically generated).
## Conclusion