An application of kNN to diagnose Diabetes
2025-04-23
The k-Nearest-Neighbors (kNN) is an algorithm that can be used in a variety of fields to classify or predict data (Ali et al. 2020)
It’s a simple algorithm that classifies data based on how similar a datapoint is to a class of datapoints. (Zhang 2016)
One of the benefits of using this algorithmic model is how simple it is to use and the fact it’s non-parametric which means it fits a wide variety of datasets.
One drawback from using this model is that it does have a higher computational cost than other models which means that it doesn’t perform as well or fast on big data (Deng et al. 2016)
In this project we focused on the methodology and application of classification kNN models in the field of healthcare or public health to predict or screen for diabetes.
The classification process has three distinct steps:
1. Distance Calculation
\[ d = \sqrt{(X_2 - X_1)^2 + (Y_2 - Y_1)^2} \] (Kataria and Singh 2013)
2. Neighbor Selection
K=5 vs K=15
3. Classification decision based on majority voting
Majority wins
The kNN algorithm assumes similar datapoints will be in close proximity to each other and be neighbors (Zhang 2016).
It also assumes that data points with similar features belong to the same class. (Boateng, Otoo, and Abaye 2020)
In order to increase the accuracy of the model there are a few parameters that we can adjust.
1. Find the optimal k parameter: We can use gridsearch to find the best parameter k.
2. Change the distance metric: The kNN uses the euclidean distance by default but we can use the Manhattan distance, the Minkowski distance or another distance.
3. Weights: The kNN defaults to a “uniform” weight where it gives the same weight to all the distances but it can be adjusted to “distance” so that the closest neighbors have more weight.
We explored the CDC Diabetes Health Indicators dataset, sourced from the UC Irvine Machine Learning Repository.
The dataset consists of 253,680 survey responses and contains 21 feature variables and 1 binary target variable named Diabetes_binary
Diabetes_binary: 0= No Diabetes, 1= Diabetes
Binary Variables: HighBP, HighChol, CholCheck, Smoker, Stroke, HeartDiseaseorAttack, PhysActivity, Fruits, Veggies, HvyAlcoholConsump, AnyHealthcare, NoDocbcCost, DiffWalk, Sex.
Ordinal Variables: GenHlth, MentHlth, PhysHlth, Age, Education, Income
Continuous Variables: BMI
Figure 5 shows us outliers in the data that can skew our results
Figure 6 shows the class imbalance present in the data
There are no missing values, meaning no imputation is needed.
We have some duplicate values that need to be removed.
There is a class imbalance with the majority of cases not having diabetes.
We chose to create four classification kNN models to illustrate the methodology.
| Table 4: Model Summary | |||||
|---|---|---|---|---|---|
| Model Name | k value | Weights | Distance | SMOTE | Decision Threshold |
| Model 1 | 5 | 'uniform' | Euclidean | No | 0.5 |
| Model 2 | 15 | 'uniform' | Euclidean | No | 0.5 |
| Model 3 | 15 | 'distance' | Euclidean | Yes | 0.5 |
| Model 4 | 15 | 'uniform' | Euclidean | Yes | 0.2 |
The table below shows the summary of the four models.
| Model | k | Weight | SMOTE | Decision Threshold | Accuracy | F1 Score | Precision | Recall | ROC AUC | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Model 1 | 5 | Uniform | No | 0.5 | 83.22% | 27.77% | 40.66% | 21.09% | 0.71 |
| 1 | Model 2 | 15 | Uniform | No | 0.5 | 84.56% | 22.38% | 48.37% | 14.56% | 0.77 |
| 2 | Model 3 | 15 | Distance | Yes | 0.5 | 67.77% | 39.77% | 27.84% | 69.58% | 0.74 |
| 3 | Model 4 | 15 | Uniform | Yes | 0.2 | 47.52% | 34.81% | 21.49% | 91.62% | 0.75 |
Model 2 has the highest accuracy at 84.56% but this accuracy score is high because it is good at detecting the non-diabetic cases which are the majority of cases.
Model 2 also has the highest ROC AUC score of 0.77 which means it’s the best model at separating different classes; however, the recall is 14.56% so it’s only classifying 14.56% of the actual positive diabetic cases.
Model 3 has an accuracy of 69.77% and a much higher recall of 69.58%. Model 3 is able to correctly classify about 70% of the positive diabetic cases.
Model 4 has the best recall with the ability to classify 91.62% of the positive diabetic cases but the overall accuracy is at 47.52% so it is classifying non-diabetic cases as diabetic.
kNN is a promising algorithmic model that can be further improved to classify or screen for diabetes.
Model 3 showed potential with classifying diabetic cases but would need to be further improved by being trained with data that shows more diabetic cases if it’s going to be used in a healthcare setting.
Model 4 showed high potential for screening for diabetes and would be useful in a public health setting where people are classified as high risk for diabetes based on their self reported helath indicators and referred for lifestyle education and biochemical testing.