Diagnosing Diseases using kNN

An application of kNN to diagnose Diabetes

Jacqueline Razo (Advisor: Dr. Cohen)

2025-04-23

Introduction

The k-Nearest-Neighbors (kNN) is an algorithm that can be used in a variety of fields to classify or predict data (Ali et al. 2020)
It’s a simple algorithm that classifies data based on how similar a datapoint is to a class of datapoints. (Zhang 2016)
One of the benefits of using this algorithmic model is how simple it is to use and the fact it’s non-parametric which means it fits a wide variety of datasets.
One drawback from using this model is that it does have a higher computational cost than other models which means that it doesn’t perform as well or fast on big data (Deng et al. 2016)
In this project we focused on the methodology and application of classification kNN models in the field of healthcare or public health to predict or screen for diabetes.

Methodology Overview

The kNN algorithm is a nonparametric supervised learning algorithm that can be used for classification or regression problems. (Syriopoulos et al. 2023)

The classification process has three distinct steps:

1. Distance Calculation

\[ d = \sqrt{(X_2 - X_1)^2 + (Y_2 - Y_1)^2} \] (Kataria and Singh 2013)

2. Neighbor Selection

  K=5  vs K=15

3. Classification decision based on majority voting

  Majority wins

Methodology Visualization

Figure 1 illustrates this methodology with two distinct classes of hearts and circles.

Assumptions

The kNN algorithm assumes similar datapoints will be in close proximity to each other and be neighbors (Zhang 2016).
It also assumes that data points with similar features belong to the same class. (Boateng, Otoo, and Abaye 2020)

Pre-processing Data

Handle missing values: We must remove the missing values by either inputting them or dropping them to prevent them from skewing the results.
Make all values numeric: All categorical values must be encoded using either one-hot encoding or label encoding.
Normalize or Standardize the features: We must use min-max scaler or the standard scaler to make sure we reduce bias.
Reduce dimensionality: We can use Principal Component Analysis to reduce the number of features but keep the variance.
Remove correlated features: The kNN works best when there aren’t too many features, so we can use a correlation matrix to see which features we can drop.
Fix class imbalance: Synthetic Minority Over-sampling Technique(SMOTE) can be used to handle class imbalances that can cause biases.

Hyperparameter Tuning

In order to increase the accuracy of the model there are a few parameters that we can adjust.

1. Find the optimal k parameter: We can use gridsearch to find the best parameter k.

2. Change the distance metric: The kNN uses the euclidean distance by default but we can use the Manhattan distance, the Minkowski distance or another distance.

3. Weights: The kNN defaults to a “uniform” weight where it gives the same weight to all the distances but it can be adjusted to “distance” so that the closest neighbors have more weight.

Dataset Overview

We explored the CDC Diabetes Health Indicators dataset, sourced from the UC Irvine Machine Learning Repository.
The dataset consists of 253,680 survey responses and contains 21 feature variables and 1 binary target variable named Diabetes_binary
Diabetes_binary: 0= No Diabetes, 1= Diabetes
Binary Variables: HighBP, HighChol, CholCheck, Smoker, Stroke, HeartDiseaseorAttack, PhysActivity, Fruits, Veggies, HvyAlcoholConsump, AnyHealthcare, NoDocbcCost, DiffWalk, Sex.
Ordinal Variables: GenHlth, MentHlth, PhysHlth, Age, Education, Income
Continuous Variables: BMI

Data Exploration and Visualization - Outliers

Figure 5 shows us outliers in the data that can skew our results

Data Exploration and Visualization - Class imbalance

Figure 6 shows the class imbalance present in the data

Data Exploration and Visualization - Key Findings

There are no missing values, meaning no imputation is needed.
We have some duplicate values that need to be removed.
There is a class imbalance with the majority of cases not having diabetes.

Building the Models

There was no missing data so we didn’t have to remove or impute any values but we did clean the data by dropping duplicate values.
We kept the ordinal variables the same as they have meaningful natural order that will provide the kNN with meaningful distances.
The data was divided into testing and training data. We used a test_size=0.2 to use 80% of the data for training the kNN and 20% of the data for testing.
The features were standardized so that BMI and age could be on the same scale as the other features.
Gridsearch was used to find the optimal parameters for hyperparameter tuning
We experimented with different decision thresholds due to the imbalanced dataset.

Modeling and Results - Model Creation

We chose to create four classification kNN models to illustrate the methodology.

Table 4: Model Summary
Model Name	k value	Weights	Distance	SMOTE	Decision Threshold
Model 1	5	'uniform'	Euclidean	No	0.5
Model 2	15	'uniform'	Euclidean	No	0.5
Model 3	15	'distance'	Euclidean	Yes	0.5
Model 4	15	'uniform'	Euclidean	Yes	0.2

Modeling and Results- Evaluating the models

The table below shows the summary of the four models.

Table 1: KNN Model Performance Summary

	Model	k	Weight	SMOTE	Decision Threshold	Accuracy	F1 Score	Precision	Recall	ROC AUC
0	Model 1	5	Uniform	No	0.5	83.22%	27.77%	40.66%	21.09%	0.71
1	Model 2	15	Uniform	No	0.5	84.56%	22.38%	48.37%	14.56%	0.77
2	Model 3	15	Distance	Yes	0.5	67.77%	39.77%	27.84%	69.58%	0.74
3	Model 4	15	Uniform	Yes	0.2	47.52%	34.81%	21.49%	91.62%	0.75

Results

Model 2 has the highest accuracy at 84.56% but this accuracy score is high because it is good at detecting the non-diabetic cases which are the majority of cases.
Model 2 also has the highest ROC AUC score of 0.77 which means it’s the best model at separating different classes; however, the recall is 14.56% so it’s only classifying 14.56% of the actual positive diabetic cases.
Model 3 has an accuracy of 69.77% and a much higher recall of 69.58%. Model 3 is able to correctly classify about 70% of the positive diabetic cases.
Model 4 has the best recall with the ability to classify 91.62% of the positive diabetic cases but the overall accuracy is at 47.52% so it is classifying non-diabetic cases as diabetic.

Conclusion

kNN is a promising algorithmic model that can be further improved to classify or screen for diabetes.
Model 3 showed potential with classifying diabetic cases but would need to be further improved by being trained with data that shows more diabetic cases if it’s going to be used in a healthcare setting.
Model 4 showed high potential for screening for diabetes and would be useful in a public health setting where people are classified as high risk for diabetes based on their self reported helath indicators and referred for lifestyle education and biochemical testing.

References

Ali, AMEER, MOHAMMED Alrubei, LF Mohammed Hassan, M Al-Ja’afari, and Saif Abdulwahed. 2020. “Diabetes Classification Based on KNN.” IIUM Engineering Journal 21 (1): 175–81.

Boateng, Ernest Yeboah, Joseph Otoo, and Daniel A Abaye. 2020. “Basic Tenets of Classification Algorithms k-Nearest-Neighbor, Support Vector Machine, Random Forest and Neural Network: A Review.” Journal of Data Analysis and Information Processing 8 (4): 341–57.

Deng, Zhenyun, Xiaoshu Zhu, Debo Cheng, Ming Zong, and Shichao Zhang. 2016. “Efficient kNN Classification Algorithm for Big Data.” Neurocomputing 195: 143–48.

Kataria, Aman, and MD Singh. 2013. “A Review of Data Classification Using k-Nearest Neighbour Algorithm.” International Journal of Emerging Technology and Advanced Engineering 3 (6): 354–60.

Syriopoulos, Panos K, Nektarios G Kalampalikis, Sotiris B Kotsiantis, and Michael N Vrahatis. 2023. “K NN Classification: A Review.” Annals of Mathematics and Artificial Intelligence, 1–33.

Zhang, Zhongheng. 2016. “Introduction to Machine Learning: K-Nearest Neighbors.” Annals of Translational Medicine 4 (11).