k-means cluster analysis of speech and language data
Here we want to carry out a k-means cluster analysis to identify different profiles of participants who have similar speech, language, and cognitive abilities. The data collected consist of two tests of phonological skills, working memory, and perception of familiar and unfamiliar speech sounds.
First, we will import packages and functions we need:
import subprocess
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
import numpy as np
import pandas as pd
import scipy.cluster.hierarchy as shc
from sklearn.cluster import KMeans
from sklearn.cluster import AgglomerativeClustering
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
Next, we need to run the R script that will pre-process the speech data. The script fits linear or non-linear hierarchical models for each of the data sets and extracts the participant-level effects that will serve as our estimates for each participant. The script will write a csv file to our current directory.
call_Rscript = subprocess.call('Rscript extract_estimates.R', shell=True)
call_Rscript
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
✔ ggplot2 3.3.6 ✔ purrr 0.3.4
✔ tibble 3.1.7 ✔ dplyr 1.0.9
✔ tidyr 1.2.0 ✔ stringr 1.4.0
✔ readr 2.1.2 ✔ forcats 0.5.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
Attache Paket: ‘nlme’
Das folgende Objekt ist maskiert ‘package:dplyr’:
collapse
Attache Paket: ‘data.table’
Die folgenden Objekte sind maskiert von ‘package:dplyr’:
between, first, last
Das folgende Objekt ist maskiert ‘package:purrr’:
transpose
Lade nötiges Paket: Matrix
Attache Paket: ‘Matrix’
Die folgenden Objekte sind maskiert von ‘package:tidyr’:
expand, pack, unpack
Attache Paket: ‘lme4’
Das folgende Objekt ist maskiert ‘package:nlme’:
lmList
0
Now we will read in the file that we just wrote as a data frame and look at the first few rows:
speech_data = pd.read_csv('speech_data.csv', sep=',', header=0)
speech_data.head()
| Unnamed: 0 | Subject | slope_ba_da | consistency_ba_da | slope_s_sh | consistency_s_sh | dprime | |
|---|---|---|---|---|---|---|---|
| 0 | 1 | 101 | 0.792550 | -1.124428 | 0.917763 | -0.702372 | 0.611972 |
| 1 | 2 | 102 | 0.464940 | -1.240745 | 0.450287 | -1.211086 | 1.067259 |
| 2 | 3 | 103 | 1.215431 | -1.179974 | 0.692098 | -0.799731 | 0.691741 |
| 3 | 4 | 104 | 0.506354 | -0.978700 | 0.867693 | -1.885947 | 0.057793 |
| 4 | 5 | 106 | 0.374636 | -0.451488 | 0.646777 | -1.194057 | 1.189263 |
It looks like it has an extra column called "Unnamed: 0" that we don't need, so we'll get rid of that:
speech_data = speech_data.drop(columns=['Unnamed: 0'])
speech_data.head()
| Subject | slope_ba_da | consistency_ba_da | slope_s_sh | consistency_s_sh | dprime | |
|---|---|---|---|---|---|---|
| 0 | 101 | 0.792550 | -1.124428 | 0.917763 | -0.702372 | 0.611972 |
| 1 | 102 | 0.464940 | -1.240745 | 0.450287 | -1.211086 | 1.067259 |
| 2 | 103 | 1.215431 | -1.179974 | 0.692098 | -0.799731 | 0.691741 |
| 3 | 104 | 0.506354 | -0.978700 | 0.867693 | -1.885947 | 0.057793 |
| 4 | 106 | 0.374636 | -0.451488 | 0.646777 | -1.194057 | 1.189263 |
This looks better. Now we'll read in the rest of the data from a different file and change the column "subject_nr" to "Subject" to match the other data frame. We'll need these columns to match so we can merge the two data frames.
tests = pd.read_csv('phon_wm_scores.csv',sep=',', header=0)
tests = tests.rename({'subject_nr': 'Subject'}, axis='columns')
tests.head()
| Subject | nonword_repetition | sound_blending | num_reversed | aud_wm | |
|---|---|---|---|---|---|
| 0 | 101 | 8 | 29 | 12 | 29 |
| 1 | 102 | 7 | 25 | 15 | 35 |
| 2 | 103 | 7 | 32 | 11 | 25 |
| 3 | 104 | 7 | 29 | 18 | 35 |
| 4 | 106 | 7 | 25 | 13 | 29 |
Now we merge the data frames and set the Subject column as the index column:
all_measures = pd.merge(tests, speech_data, on='Subject')
all_measures = all_measures.set_index('Subject')
all_measures.head()
| nonword_repetition | sound_blending | num_reversed | aud_wm | slope_ba_da | consistency_ba_da | slope_s_sh | consistency_s_sh | dprime | |
|---|---|---|---|---|---|---|---|---|---|
| Subject | |||||||||
| 101 | 8 | 29 | 12 | 29 | 0.792550 | -1.124428 | 0.917763 | -0.702372 | 0.611972 |
| 102 | 7 | 25 | 15 | 35 | 0.464940 | -1.240745 | 0.450287 | -1.211086 | 1.067259 |
| 103 | 7 | 32 | 11 | 25 | 1.215431 | -1.179974 | 0.692098 | -0.799731 | 0.691741 |
| 104 | 7 | 29 | 18 | 35 | 0.506354 | -0.978700 | 0.867693 | -1.885947 | 0.057793 |
| 106 | 7 | 25 | 13 | 29 | 0.374636 | -0.451488 | 0.646777 | -1.194057 | 1.189263 |
Cluster analyses work better with scaled data so the values aren't too different. We will standardize the scores with the following formula: y = (y-mean)/sd using the StandardScalar function.
scaler = StandardScaler()
scaled_features = scaler.fit_transform(all_measures)
all_measures_scaled = pd.DataFrame(scaled_features, index=all_measures.index, columns=all_measures.columns)
all_measures_scaled.head()
| nonword_repetition | sound_blending | num_reversed | aud_wm | slope_ba_da | consistency_ba_da | slope_s_sh | consistency_s_sh | dprime | |
|---|---|---|---|---|---|---|---|---|---|
| Subject | |||||||||
| 101 | -0.501848 | 0.448155 | -1.269763 | -0.959053 | 0.642975 | -0.053508 | 1.638610 | 1.021420 | -0.198564 |
| 102 | -0.898832 | -0.682902 | -0.449063 | 0.402460 | -0.405124 | -0.369450 | -1.824150 | -0.297725 | 0.457181 |
| 103 | -0.898832 | 1.296447 | -1.543330 | -1.866728 | 1.995866 | -0.204383 | -0.032967 | 0.768960 | -0.083674 |
| 104 | -0.898832 | 0.448155 | 0.371638 | 0.402460 | -0.272631 | 0.342319 | 1.267726 | -2.047707 | -0.996743 |
| 106 | -0.898832 | -0.682902 | -0.996196 | -0.959053 | -0.694028 | 1.774340 | -0.368678 | -0.253567 | 0.632902 |
To determine the number of clusters we want, we'll use the elbow method. To do this, we will first do a principal components analysis and plot the variance explained by each component in a scree plot. Wherever the line bends and levels off (i.e., the number of components that explain the most variance) is the number of components we will select.
#define PCA model to use
pca = PCA(n_components=9)
#fit PCA model to data
pca_fit = pca.fit(all_measures_scaled)
PC_values = np.arange(pca.n_components_) + 1
plt.style.use("fivethirtyeight")
plt.figure(figsize=(5, 4))
plt.plot(PC_values, pca.explained_variance_ratio_, 'o-', linewidth=2, color='blue')
plt.title('Scree Plot')
plt.xlabel('Principal Component')
plt.ylabel('Variance Explained')
plt.show()
From the scree plot, it looks like we don't explain too much more of the variance with more than four components, though we also see a bend at two. Since our data set is pretty small, we'll go with two components so we don't have too few participants in a cluster.
# Create a k-means clustering model
kmeans = KMeans(init='random', n_clusters=2, n_init=10)
# Fit the data to the model
kmeans.fit(all_measures_scaled)
# Determine which clusters each data point belongs to:
clusters = kmeans.predict(all_measures_scaled)
# Add cluster number to the original data
all_measures_scaled_clustered = pd.DataFrame(all_measures_scaled, columns=all_measures_scaled.columns, index=all_measures_scaled.index)
all_measures_scaled_clustered['cluster'] = clusters
all_measures_scaled_clustered.head()
| nonword_repetition | sound_blending | num_reversed | aud_wm | slope_ba_da | consistency_ba_da | slope_s_sh | consistency_s_sh | dprime | cluster | |
|---|---|---|---|---|---|---|---|---|---|---|
| Subject | ||||||||||
| 101 | -0.501848 | 0.448155 | -1.269763 | -0.959053 | 0.642975 | -0.053508 | 1.638610 | 1.021420 | -0.198564 | 1 |
| 102 | -0.898832 | -0.682902 | -0.449063 | 0.402460 | -0.405124 | -0.369450 | -1.824150 | -0.297725 | 0.457181 | 1 |
| 103 | -0.898832 | 1.296447 | -1.543330 | -1.866728 | 1.995866 | -0.204383 | -0.032967 | 0.768960 | -0.083674 | 1 |
| 104 | -0.898832 | 0.448155 | 0.371638 | 0.402460 | -0.272631 | 0.342319 | 1.267726 | -2.047707 | -0.996743 | 1 |
| 106 | -0.898832 | -0.682902 | -0.996196 | -0.959053 | -0.694028 | 1.774340 | -0.368678 | -0.253567 | 0.632902 | 1 |
# Create a PCA model to reduce our data to 2 dimensions for visualisation
pca = PCA(n_components=2)
pca.fit(all_measures_scaled)
# Transfor the scaled data to the new PCA space
measures_reduced = pca.transform(all_measures_scaled)
# Convert to a data frame
measures_reduceddf = pd.DataFrame(measures_reduced, index=all_measures_scaled.index, columns=['PC1','PC2'])
measures_reduceddf['cluster'] = clusters
measures_reduceddf.head()
| PC1 | PC2 | cluster | |
|---|---|---|---|
| Subject | |||
| 101 | -1.116364 | -0.440510 | 1 |
| 102 | -0.405248 | -0.648170 | 1 |
| 103 | -1.681884 | -1.272908 | 1 |
| 104 | -0.932461 | 1.146899 | 1 |
| 106 | -1.133984 | -2.090004 | 1 |
plt.style.use("fivethirtyeight")
plt.figure(figsize=(5, 4))
scat = sns.scatterplot(
data=measures_reduceddf,
x="PC1",
y="PC2",
hue="cluster",
style="cluster",
palette="Set2",
)
Here we can see the clusters of participants who show similar patterns on the cognitive, speech, and language measures we gathered.