Diagnosing Diseases using kNN

An application of kNN to diagnose Diabetes

Author

Jacqueline Razo (Advisor: Dr. Cohen)

Published

April 28, 2025

Slides: slides.html ( Go to slides.qmd to edit)

Introduction

The k-Nearest-Neighbors (kNN) is an algorithm that is being used in a variety of fields to classify or predict data. The kNN algorithm is a simple algorithm that classifies data based on how similar a datapoint is to a class of datapoints. One of the benefits of using this algorithmic model is how simple it is to use and the fact it’s non-parametric which means it fits a wide variety of datasets. One drawback from using this model is that it does have a higher computational cost than other models which means that it doesn’t perform as well or fast on big data. Despite this, the model’s simplicity makes it easy to understand and easy to implement in a variety of fields. One such field is the field of healthcare where kNN models have been successfully used to predict diseases such as diabetes and hypertension. In this paper we will focus on the methodology and application of kNN models in the field of healthcare or public health to predict or screen for diabetes, a pressing public health problem.

Literature Review

The literature review explores the theoretical background of kNN and key factors affecting its performance. Recent advancements in optimizing kNN for large datasets and the role of kNN in medical diagnosis, particularly diabetes prediction.

Theoretical Background of KNN

kNNs are supervised learning algorithms that work by comparing a data point to other similar data points to label it. It works on the assumption that data points that are similar to each other must be close to each other. In the thesis (Z. Zhang 2016), the author gave the reader an introduction to how kNN works and how to run a kNN model in R studio. He describes the methodology as assigning an unlabeled observation to a class by using labeled examples that are similar to it. It also describes the Euclidean distance equation which is the default distance equation that is used for kNNs. The author also describes the impact the k parameter has on the algorithm. The k parameter is the parameter that tells the model how many neighbors it will use when trying to classify a data point. Zhang recommends setting the k parameter equal to the square root of the number of observations in the training dataset.

Although Zhang’s recommendation to set the k parameter could be a great starting point, the thesis (S. Zhang et al. 2017) proposed the decision tree-assisted tuning to optimize k, significantly enhancing accuracy. The authors of this thesis propose using a training stage where we use a decision tree to select the ideal number of k values and thus make the kNN more efficient. The authors deployed and tested two more efficient kNN methods called kTree and the k*Tree methods. They found their method did reduce running costs and increased classification accuracy.

Another big impact on accuracy is the distance the model uses to classify neighbors. Although the euclidean distance is the default distance that is used in kNNs there are other distances that can be used. In the thesis (Kataria and Singh 2013) the authors compare different distances in classification algorithms with a focus on the kNN algorithm. It starts off explaining how the kNN algorithm uses the nearest k-neighbors in order to classify data points and then describes how the euclidean distance does this by putting a line segment between point a and point b and then measuring the distance using the euclidean distance formula. It moves on to describe the “cityblock” or taxican distance and describes it as “the sum of the length of the projections of the line segment”. It also describes the cosine distance and the correlation distance and then compares the performance of the default euclidean distance to the performance of using city block, cosine and correlation distances. In the end it found the euclidean distance was more efficient than the others in their observations.

Syriopoulos et al. (Syriopoulos et al. 2023) also reviewed distance metric selection, confirming that Euclidean distance remains the most effective choice for most datasets. However, alternative metrics like Mahalanobis distance can perform better for correlated features. The review emphasized that selecting the right metric is dataset-dependent, influencing classification accuracy.

Challenges in Scaling kNN for Large Datasets

While kNN is simple and effective, it struggles with computational inefficiency when working with large datasets since it must calculate distances for every new observation. This becomes a major challenge in big data, where the sheer volume of information makes traditional kNN methods slow and resource-intensive.

To address this, Deng et al. (Deng et al. 2016) proposed an improved approach called LC-kNN, which combines k-means clustering with kNN to speed up computations and enhance accuracy. By dividing large datasets into smaller clusters, their method reduces the number of distance calculations needed. After extensive testing, the authors found that LC-kNN consistently outperformed standard kNN, achieving higher accuracy and better efficiency. Their study highlights a key limitation of traditional kNN (without optimization, its performance significantly declines on big data) and offers an effective solution to improve its scalability.

Continuing and summarizing these ideas, Syriopoulos et al. (Syriopoulos et al. 2023) explored techniques for accelerating kNN computations, such as:

  • Dimensionality reduction (e.g., PCA, feature selection) to reduce data complexity.
  • Approximate Nearest Neighbor (ANN) methods to speed up distance calculations.
  • Hybrid models combining kNN with clustering (e.g., LC-kNN) to improve efficiency.

This approach enhanced both speed and accuracy, making it a promising solution for handling large datasets. In addition, the study categorizes kNN modifications into local hyperplane methods, fuzzy-based models, weighting schemes, and hybrid approaches, demonstrating how these adaptations help tackle issues like class imbalance, computational inefficiency, and sensitivity to noise.

Another key challenge for kNN is its performance in high-dimensional datasets. The 2023 study by Syriopoulos et al. evaluates multiple nearest neighbor search algorithms such as kd-trees, ball trees, Locality-Sensitive Hashing (LSH), and graph-based search methods that enable kNN performance scaling for larger datasets through minimized distance calculations.

The enhancements to kNN have substantially increased its performance in terms of speed and accuracy which now allows it to better handle large-scale datasets. However, as Syriopoulos et al. primarily compile prior research rather than conducting empirical comparisons, further work is needed to evaluate these optimizations in real-world medical classification tasks.

kNN in Disease Prediction: Applications & Limitations

Disease Prediction with kNN

kNN has been widely used for diabetes classification and early detection. Ali et al. (Ali et al. 2020) tested six different kNN variants in MATLAB to classify blood glucose levels, finding that fine kNN was the most accurate. Their research highlights how optimizing kNN can improve classification performance, making it a valuable tool in healthcare.

In turn, Saxena et al. (Saxena, Khan, and Singh 2014) used kNN on a diabetes dataset and observed that increasing the number of neighbors (k) led to better accuracy, but only to a certain extent. In their MATLAB-based study, they found that using k = 3 resulted in 70% accuracy, while increasing k to 5 improved it to 75%. Both studies demonstrate how kNN can effectively classify diabetes, with accuracy depending on the choice of k and dataset characteristics. Ongoing research continues to refine kNN, making it a more efficient and reliable tool for medical applications.

kNNs need a high accuracy in order to be used in medical applications but a recurrent problem that can be observed in disease prediction is imbalanced datasets that lead to low accuracy results. In the thesis (Esposito et al. 2021), the authors found models trained on imbalanced data tended to classify the unknown data point as the majority class. This is a problem since many people are healthy and the diseases we are trying to identify make the minority class. It was noted that the default classification threshold of 0.5 doesn’t work well with real world datasets that have imbalanced data. The authors recommended adjusting the decision threshold when dealing with imbalanced datasets. They discussed using random forest or their novel method they called “Generalized tHreshOld ShifTing Procedure (GHOST)” to adjust the classification threshold of machine learning classifiers.

Feature selection is another critical factor. Panwar et al. (Panwar et al. 2016) demonstrated that focusing on just BMI and Diabetes Pedigree Function improved accuracy, suggesting that simplifying feature selection enhances model performance. The study of Suriya and Muthu (Suriya and Muthu 2023) showed that kNN is a promising model for predicting type 2 diabetes, showing the highest accuracy on smaller datasets. The authors tested three datasets of varying sizes from 692 to 1853 rows and 9-22 dimensions to test the kNN algorithm’s performance and found that the larger dataset requires a higher k-value. Besides, PCA analysis to reduce dimensionality did not improve model performance. This suggests that simplifying the data doesn’t always lead to better results in diabetes prediction. The same findings about PCA influence on ML models implementation, and kNN in particular, showed in the research of Iparraguirre-Villanueva et al. (Iparraguirre-Villanueva et al. 2023). Also, they confirmed that kNN alone is not always the best choice. Authors compared kNN with Logistic Regression, Naïve Bayes, and Decision Trees. Their results showed that while kNN performed well on balanced datasets, it struggled when class imbalances existed. While PCA significantly reduced accuracy for all models, the SMOTE-preprocessed dataset demonstrated the highest accuracy for the k-NN model (79.6%), followed by BNB with 77.2%. This reveals the importance of correct preprocessing techniques in improving kNN model accuracy, especially when handling imbalanced datasets.

Khateeb & Usman (Khateeb and Usman 2017) extended kNN’s application to heart disease prediction, demonstrating that feature selection and data balancing techniques significantly impact accuracy. Their study showed that removing irrelevant features did not always improve performance, emphasizing the need for careful feature engineering in medical datasets.

kNN Beyond Prediction: Handling Missing Data

While kNN is widely known for classification, it also plays a key role in data preprocessing for medical machine learning. Altamimi et al. (Altamimi et al. 2024) explored kNN imputation as a method to handle missing values in medical datasets. Their study showed that applying kNN imputation before training a machine learning model significantly improved diabetes prediction accuracy - from 81.13% to 98.59%. This suggests that kNN is not only useful for disease classification but also for improving data quality and completeness in healthcare applications.

Traditional methods often discard incomplete records, but kNN imputation preserves valuable information, leading to more reliable model performance. However, Altamimi et al. (2024) also highlighted challenges such as computational costs and sensitivity to parameter selection, reinforcing the need for further optimization when applying kNN to large-scale medical datasets.

Comparing kNN Variants & Hybrid Approaches

Research indicate that kNN works well for diabetes prediction, but recent studies demonstrate it doesn’t consistently provide the best results. The study by Theerthagiri et al. (Theerthagiri, Ruby, and Vidya 2022) evaluated kNN against multiple machine learning models such as Naïve Bayes, Decision Trees, Extra Trees, Radial Basis Function (RBF), and Multi-Layer Perceptron (MLP) through analysis of the Pima Indians Diabetes dataset. The research indicated that kNN performed adequately but MLP excelled beyond all other algorithms achieving top accuracy at 80.68% and leading in AUC-ROC with an 86%. Despite its effectiveness in classification tasks, kNN’s primary limitation is its inability to compete with advanced models like neural networks when processing complex datasets.

In turn, Uddin et al.(Uddin et al. 2022) explored advanced kNN variants, including Weighted kNN, Distance-Weighted kNN, and Ensemble kNN. Their findings suggest that:

  • Weighted kNN improved classification by assigning greater importance to closer neighbors.
  • Ensemble kNN outperformed standard kNN in disease prediction but required additional computational resources.
  • Performance was highly sensitive to the choice of distance metric and k value tuning.

Their findings suggest that kNN can be improved through modifications, but it remains highly sensitive to dataset size, feature selection, and distance metric choices. In large-scale healthcare applications, Decision Trees (DT) and ensemble models may offer better trade-offs between accuracy and efficiency. These studies highlight the ongoing debate over kNN’s role in medical classification - whether modifying kNN is the best approach or if other models, such as DT or ensemble learning, provide stronger performance for diagnosing diseases.

kNN continues to be a valuable tool in medical machine learning, offering simplicity and strong performance in classification tasks. However, as research shows, its effectiveness depends on proper feature selection, optimized k values, and preprocessing techniques like imputation. While kNN remains an interpretable and adaptable model, newer methods - such as ensemble learning and neural networks - often outperform it, particularly in large-scale datasets. For our capstone project, exploring feature selection, fine-tuning kNN’s settings, and comparing it to other algorithms could give us valuable insights into its strengths and limitations.

Methods

The kNN algorithm is a nonparametric supervised learning algorithm that can be used for classification or regression problems. (Syriopoulos et al. 2023) In classification, it works on the assumption that similar data is close to each other in distance. It classifies a datapoint by using the euclidean distance formula to find the nearest k data specified. Once these k data points have been found, the kNN assigns a category to the new datapoint based off the category with the majority of the data points that are similar. (Z. Zhang 2016). Figure 1 illustrates this methodology with two distinct classes of hearts and circles. The knn algorithm is attempting to classify the mystery figure represented by the red square. The k parameter is set to k=5 which means the algorithm will use the euclidean distance formula to find the 5 nearest neighbors illustrated by the green circle. From here the algorithm simply counts the number from each class and designates the class that represents the majority which in this case is a heart.

Figure 1

Classification process

The classification process has three distinct steps:

1. Distance calculation

The knn first measures the distance between the datapoint it’s trying to classify and all the training data points. There are different distance calculation methods that can be used but the default and most commonly used method with the kNN is the Euclidean distance formula. (Theerthagiri, Ruby, and Vidya 2022):

\[ d = \sqrt{(X_2 - X_1)^2 + (Y_2 - Y_1)^2} \]

Code
library(ggplot2)

#Add points (X1, Y1) and (X2, Y2)
X1 <- 10; Y1 <- 12
X2 <- 14; Y2 <- 16

#creates a plot
plot(c(X1, X2), c(Y1, Y2), type = "n", xlab = "X-axis", ylab = "Y-axis", main = "Figure 2: Euclidean Distance",xlim = c(X1 - 4, X2 + 4), ylim = c(Y1 - 4, Y2 + 4))

#Plot first point
points(X1, Y1, col = "red", pch = 16, cex = 2) 

#Plot second point
points(X2, Y2, col = "blue", pch = 16, cex = 2)

#Add horizontal line
segments(X1, Y1, X2, Y1, col = "green", lwd = 2)

#Add vertical line 
segments(X2, Y1, X2, Y2, col = "green", lwd = 2)

#Add hypotenuse line
segments(X1, Y1, X2, Y2, col = "purple", lwd = 2, lty = 2)

#Add labels
text(X1, Y1, labels = paste("(X1, Y1)\n(", X1, ",", Y1, ")"), pos = 2, col = "red", cex = 0.7) 
text(X2, Y2, labels = paste("(X2, Y2)\n(", X2, ",", Y2, ")"), pos = 4, col = "blue", cex = 0.7)
text((X1 + X2) / 2 -2, (Y1 + Y2) / 2 + 3, "Euclidean Distance (d)", col = "purple", font = 2, cex = 1.2)
arrows((X1 + X2) / 2, (Y1 + Y2) / 2 + 2,(X1 + X2) / 2, (Y1 + Y2) / 2,col = "purple", lwd = 2, length = 0.1)

#insert formula
text(mean(c(X1, X2)), mean(c(Y1, Y2)) -5, 
     labels = expression(d == sqrt((14 - 10)^2 + (16 - 12)^2)), 
     col = "black", cex = 0.9, font = 1)

text(mean(c(X1, X2)), mean(c(Y1, Y2)) -5, 
     labels = expression(d == sqrt((14 - 10)^2 + (16 - 12)^2)), 
     col = "black", cex = 0.9, font = 1)

Figure 2 shows the euclidean distance formula where \(X_2 - X_1\) calculates the horizontal difference and \(Y_2 - Y_1\) calculates the vertical difference. These two distances are then squared to ensure they are positive regardless of which directionality it has. Squaring the distances also gives greater emphasis to larger distances.

2. Neighbor Selection

The kNN allows the selection of a parameter k that is used by the algorithm to choose how many neighbors will be used to classify the unknown datapoint. The k parameter is very important as a k parameter that is too large can lead to a classification problem caused by a majority of the samples creating a bias and causing underfitting. (Mucherino et al. 2009) A k being too small can cause the algorithm to be too sensitive to noise and outliers which can cause overfitting. Studies recommend using cross-validation or heuristic methods, such as setting k to the square root of the dataset size, to determine an optimal value (Syriopoulos et al. 2023).

3. Classification decision based on majority voting

Once the k-nearest neighbors are identified, the algorithm assigns the new data point the most frequent class label among its neighbors. In cases of ties, distance-weighted voting can be applied, where closer neighbors have higher influence on the classification decision (Uddin et al. 2022).

Assumptions

The kNN algorithm calculates the euclidean distance between the unknown datapoint and the testing datapoints because it assumes similar datapoints will be in close proximity to each other and be neighbors and that data points with similar features belong to the same class. (Boateng, Otoo, and Abaye 2020)

Implementation of kNN

Code
library(DiagrammeR)

grViz("
digraph {
  graph [layout = dot, rankdir = LR, splines = true, size= 10]
  node [shape = box, style = rounded, fillcolor = lightblue, fontname = Arial, fontsize = 25, penwidth = 2]
  
  A [label = '1. Load Required Libraries',width=3, height=1.5]
  B [label = '2. Import & Explore Dataset',width=3, height=1.5]
  C [label = '3. Is preprocessing required?', shape = circle, fillcolor = lightblue, width=0.8, height=0.8, fontsize=25]
  D [label = '3a. Pre-Process the data',width=3, height=1.5]
  E [label = '4. Split Dataset into Training & Testing',width=3, height=1.5]
  F [label = '5. Hyperparameter tuning',width=3, height=1.5]
  G [label = '6. Train kNN Model',width=3, height=1.5]
  H [label = '7. Make Predictions',width=3, height=1.5]
  I [label = '8. Evaluate Model',width=3, height=1.5]
  
  A -> B
  B -> C
  C -> E [label = 'No', fontsize=25]
  C -> D [label = 'Yes', fontsize=25]
  D -> E
  E -> F
  F -> G
  G -> H
  H -> I
  
  #Edge Style
  edge [color = '#8B814C', arrowhead = vee, penwidth = 2]
}
")

Pre-processing Data

Data must be prepared before implementing the kNN. In order for the kNN algorithm to work we need to handle missing values, make all values numeric and normalize or standardize the features. We also have the option of increasing accuracy by reducing dimensionality, removing correlated features and fixing class imbalance if we notice our data needs it.

  1. Handle missing values: kNN’s work by calculating the distance between datapoints and missing values can skew the results. We must remove the missing values by either inputting them or dropping them.
  2. Make all values numeric: kNN’s only handle numeric values so all categorical values must be encoded using either one-hot encoding or label encoding.
  3. Normalize or Standardize the features: We must normalize or standardize the features to make sure we reduce bias. We can use the min-max scaler or the standard scaler to do this.
  4. Reduce dimensionality: The kNN can struggle to calculate the distance between features if there are too many features. In order to solve this we can use Principal Component Analysis to reduce the number of features but keep the variance.
  5. Remove correlated features: The kNN works best when there aren’t too many features, so we can use a correlation matrix to see which features we can drop. For example, it might be good to drop any features that have low variance or have a high correlation over 0.9 because this can be redundant.
  6. Fix class imbalance: Class imbalances can lead to a bias. We noticed a class imbalance in our dataset and chose to use Synthetic Minority Over-sampling Technique(SMOTE) in order to handle the imbalance.

Hyperparameter Tuning

In order to increase the accuracy of the model there are a few parameters that we can adjust.

  1. Find the optimal k parameter: We can use gridsearch to find the best parameter k.
  2. Change the distance metric: The kNN uses the euclidean distance by default but we can use the Manhattan distance, the Minkowski distance or another distance.
  3. Weights: The kNN defaults to a “uniform” weight where it gives the same weight to all the distances but it can be adjusted to “distance” so that the closest neighbors have more weight.

Advantages and Limitations

One of the advantages of the kNN is it’s easy to understand and implement. It is able to maintain great accuracy even with noisy data. (Syriopoulos et al. 2023). A serious limitation it has is the high computational cost and that it needs a large amount of memory to calculate the distance between all the datapoints.The kNN also has low accuracy with multidimensional data that has irrelevant features. (Saxena, Khan, and Singh 2014). Having to calculate the distance for all the datapoints can cause the knn to be slower when the number of datapoints gets too large as is the case with big data. The kNN takes a significant amount of time calculating the distances between at the datapoints in a big file. (Deng et al. 2016).

Analysis and Results

Data Exploration & Visualization

We explored the CDC Diabetes Health Indicators dataset, sourced from the UC Irvine Machine Learning Repository. It is a set of data that was gathered by the Centers for Disease Control and Prevention (CDC) through the Behavioral Risk Factor Surveillance System (BRFSS), which is one of the biggest continuous health surveys in the United States.

The BRFSS is an annual telephone survey that has been ongoing since 1984 and each year, more than 400,000 Americans respond to the survey. It provides important data on health behaviors, chronic diseases, and preventive health care use to help researchers and policymakers understand the health status and risks of the public.

Python and the ucimlrepo package was used to import the dataset directly from the UCI Machine Learning Repository, following the recommended instructions. This enabled us to easily save, prepare, and analyze the data in view of the current research.

Code
from ucimlrepo import fetch_ucirepo 

# Loading the dataset  
# fetch dataset 
cdc_diabetes_health_indicators = fetch_ucirepo(id=891) 
  
# data (as pandas dataframes) 
X = cdc_diabetes_health_indicators.data.features 
y = cdc_diabetes_health_indicators.data.targets 
  

Data Composition

The following table displays the first few rows of the CDC Diabetes Health Indicators dataset after importing it from UC Irvine Machine Learning Repository.

Code
library(knitr)
library(readr)

cdc_data_df <- read_csv("cdc_data.csv")

#kable(head(cdc_data_df))
kable(head(cdc_data_df), caption = "Table 1: CDC Diabetes Health Indicators dataset")
Table 1: CDC Diabetes Health Indicators dataset
HighBP HighChol CholCheck BMI Smoker Stroke HeartDiseaseorAttack PhysActivity Fruits Veggies HvyAlcoholConsump AnyHealthcare NoDocbcCost GenHlth MentHlth PhysHlth DiffWalk Sex Age Education Income Diabetes_binary
1 1 1 40 1 0 0 0 0 1 0 1 0 5 18 15 1 0 9 4 3 0
0 0 0 25 1 0 0 1 0 0 0 0 1 3 0 0 0 0 7 6 1 0
1 1 1 28 0 0 0 0 1 0 0 1 1 5 30 30 1 0 9 4 8 0
1 0 1 27 0 0 0 1 1 1 0 1 0 2 0 0 0 0 11 3 6 0
1 1 1 24 0 0 0 1 1 1 0 1 0 2 3 0 0 0 11 5 4 0
1 1 1 25 1 0 0 1 1 1 0 1 0 2 0 2 0 1 10 6 8 0

Upon closer inspection we can see the dataset consists of 253,680 survey responses and contains 21 feature variables and 1 binary target variable named Diabetes_binary. Table 2 shows a detailed summary of the different variables representing demographic, behavioral, and health-related attributes and the range of values found in each variable.

Code
# Load packages
library(knitr)

# Create Table 
table_data <- data.frame(
  Type = c(
    "Target",
    "Binary", "", "", "", "", "", "", "", "", "", "", "", "", "",
    "Ordinal", "", "", "", "", "",
    "Continuous"
  ),
  Variable = c(
    "Diabetes_binary",
    "HighBP", "HighChol", "CholCheck", "Smoker", "Stroke", "HeartDiseaseorAttack", 
    "PhysActivity", "Fruits", "Veggies", "HvyAlcoholConsump", "AnyHealthcare", 
    "NoDocbcCost", "DiffWalk", "Sex",
    "GenHlth", "MentHlth", "PhysHlth", "Age", "Education", "Income",
    "BMI"
  ),
  Description = c(
    "Indicates whether a person has diabetes",
    "High Blood Pressure", "High Cholesterol", "Cholesterol check in the last 5 years",
    "Smoked at least 100 cigarettes in lifetime", "Had a stroke", "History of heart disease or attack",
    "Engaged in physical activity in the last 30 days", "Regular fruit consumption", 
    "Regular vegetable consumption", "Heavy alcohol consumption", "Has health insurance or healthcare access",
    "Could not see a doctor due to cost", "Difficulty walking/climbing stairs", "Biological sex",
    "Self-reported general health (1=Excellent, 5=Poor)", 
    "Number of mentally unhealthy days in last 30 days", "Number of physically unhealthy days in last 30 days",
    "Age Groups (1 = 18-24, ..., 13 = 80+)", 
    "Highest education level (1 = No school, ..., 6 = College graduate)", 
    "Household income category (1 = <$10K, ..., 8 = $75K+)", 
    "Body Mass Index (BMI), measure of body fat"
  ),
  Range = c(
    "(0 = No, 1 = Yes)",
    "(0 = No, 1 = Yes)", "(0 = No, 1 = Yes)", "(0 = No, 1 = Yes)", "(0 = No, 1 = Yes)",
    "(0 = No, 1 = Yes)", "(0 = No, 1 = Yes)", "(0 = No, 1 = Yes)", "(0 = No, 1 = Yes)", 
    "(0 = No, 1 = Yes)", "(0 = No, 1 = Yes)", "(0 = No, 1 = Yes)", "(0 = No, 1 = Yes)", 
    "(0 = No, 1 = Yes)", "(0 = Female, 1 = Male)",
    "(1 = Excellent, ..., 5 = Poor)", "(0 - 30)", "(0 - 30)", 
    "(1 = 18-24, ..., 13 = 80+)", "(1 = No school, ..., 6 = College grad)", 
    "(1 = <$10K, ..., 8 = $75K+)", "(12 - 98)"
  )
)

# Print Table 
kable(table_data, caption = "Table 2. Summary of 22 Variables", align = "l")
Table 2. Summary of 22 Variables
Type Variable Description Range
Target Diabetes_binary Indicates whether a person has diabetes (0 = No, 1 = Yes)
Binary HighBP High Blood Pressure (0 = No, 1 = Yes)
HighChol High Cholesterol (0 = No, 1 = Yes)
CholCheck Cholesterol check in the last 5 years (0 = No, 1 = Yes)
Smoker Smoked at least 100 cigarettes in lifetime (0 = No, 1 = Yes)
Stroke Had a stroke (0 = No, 1 = Yes)
HeartDiseaseorAttack History of heart disease or attack (0 = No, 1 = Yes)
PhysActivity Engaged in physical activity in the last 30 days (0 = No, 1 = Yes)
Fruits Regular fruit consumption (0 = No, 1 = Yes)
Veggies Regular vegetable consumption (0 = No, 1 = Yes)
HvyAlcoholConsump Heavy alcohol consumption (0 = No, 1 = Yes)
AnyHealthcare Has health insurance or healthcare access (0 = No, 1 = Yes)
NoDocbcCost Could not see a doctor due to cost (0 = No, 1 = Yes)
DiffWalk Difficulty walking/climbing stairs (0 = No, 1 = Yes)
Sex Biological sex (0 = Female, 1 = Male)
Ordinal GenHlth Self-reported general health (1=Excellent, 5=Poor) (1 = Excellent, …, 5 = Poor)
MentHlth Number of mentally unhealthy days in last 30 days (0 - 30)
PhysHlth Number of physically unhealthy days in last 30 days (0 - 30)
Age Age Groups (1 = 18-24, …, 13 = 80+) (1 = 18-24, …, 13 = 80+)
Education Highest education level (1 = No school, …, 6 = College graduate) (1 = No school, …, 6 = College grad)
Income Household income category (1 = <$10K, …, 8 = $75K+) (1 = <$10K, …, 8 = $75K+)
Continuous BMI Body Mass Index (BMI), measure of body fat (12 - 98)

This dataset provides a large-scale representation of diabetes-related risk factors, making it valuable for exploratory data analysis, statistical modeling, and machine learning applications aimed at improving diabetes risk assessment and prevention strategies.

Data Integrity Assessment

In this step, we checked for null values, missing data (NaNs), and duplicate rows to ensure data integrity. Additionally, we identified columns with invalid values such as strings with spaces in numeric fields.

Code

import pandas as pd
from ucimlrepo import fetch_ucirepo 
  
# dataset 
cdc_diabetes_health_indicators = fetch_ucirepo(id=891) 
  
# data (as pandas dataframes) 
X = cdc_diabetes_health_indicators.data.features 
y = cdc_diabetes_health_indicators.data.targets 

cdc_data_df = pd.concat([cdc_diabetes_health_indicators.data.features, 
                         cdc_diabetes_health_indicators.data.targets], axis=1)
                         
exploratory_data_analysis = { "Exploratory Data Analysis": ["Number of Nulls", "Missing Data", "Duplicate Rows", "Total Rows"], "Count": [cdc_data_df.isna().sum().sum(), (cdc_data_df == " ").sum().sum(), cdc_data_df.duplicated().sum(), cdc_data_df.shape[0]]}

exploratory_data_analysis_df=pd.DataFrame(exploratory_data_analysis)

exploratory_data_analysis_df.to_csv("eda.csv", index=False)
Code
library(knitr)
library(readr)

# Load the dataset
exploratory_df <- read_csv("eda.csv")

# Print table
kable(exploratory_df, caption = "Table 3: Data Integrity Report")
Table 3: Data Integrity Report
Exploratory Data Analysis Count
Number of Nulls 0
Missing Data 0
Duplicate Rows 24206
Total Rows 253680

Key Findings:

There are no missing values, meaning no imputation is needed.

24,206 duplicate records were detected, which need to be be analyzed to determine whether they need removal or weighting to prevent redundancy in model training.

Statistical Summary

A summary of the dataset’s key statistical properties provides insights into central tendencies, variability, and distribution patterns. This analysis helps identify potential imbalances, outliers, and preprocessing needs, such as scaling or encoding, to ensure optimal model performance.

Figure 4 shows a graph of the mean of different features in the data. It shows BMI which is a continuous variable indicating body mass index and the 6 ordinal values that includes demographics such as age, income, and education and the self-reported health status of GenHlth, MentHlth, PhysHlth.

Code
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from ucimlrepo import fetch_ucirepo

# Import dataset
cdc_diabetes_health_indicators = fetch_ucirepo(id=891)

# Combine features and targets into one DataFrame
cdc_data_df = pd.concat([
    cdc_diabetes_health_indicators.data.features,
    cdc_diabetes_health_indicators.data.targets
], axis=1)

# Exclude binary variables
ord_variables = ['GenHlth', 'MentHlth', 'PhysHlth', 'Age', 'Education', 'Income', 'BMI']

# Calculate means 
mean_values = cdc_data_df[ord_variables].mean().sort_values(ascending=False)

# Create plot 
plt.figure(figsize=(10, 6))
sns.barplot(x=mean_values.values, y=mean_values.index, palette="plasma")
plt.title("Figure 4: Mean Values of Ordinal and Continuous Variables")
plt.xlabel("Mean")
plt.ylabel("Feature")
plt.tight_layout()
plt.show()

From this graph we can quickly see BMI and age are the features with the highest mean which can impact the results of the kNN by dominating the distance metric. Similarly, GenHlth is the feature with the lowest mean which can make the kNN ignore it even if it’s an important predictor. A good way to combat this problem is to standardize or scale the features.

Similarly outliers can also skew the kNN by incresing the average distance that is calculated. Figure 5 shows us a box-and-whisker plot of the same variables as above but with the outliers.

Code
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from ucimlrepo import fetch_ucirepo

# Import dataset
cdc_diabetes_health_indicators = fetch_ucirepo(id=891)

# Combine features and targets into one DataFrame
cdc_data_df = pd.concat([
    cdc_diabetes_health_indicators.data.features,
    cdc_diabetes_health_indicators.data.targets
], axis=1)

# Exclude binary columns
ord_variables = ['GenHlth', 'MentHlth', 'PhysHlth', 'Age', 'Education', 'Income', 'BMI']

# Create boxplot
plt.figure(figsize=(12, 6))
sns.boxplot(data=cdc_data_df[ord_variables], orient="h", palette="Set2")
plt.title("Figure 5: Boxplot showing Outliers for Ordinal and Continuous Variables")
plt.xlabel("Value")
plt.tight_layout()
plt.show()

This boxplot shows us that BMI, MentHlth, and PhysHlth have many outliers on the upper end of the distribution. Although the median doesn’t seem to be affected these outliers can increase the average of these features which can distort the distance formula in the kNN. Some effective methods to deal with outliers would be to remove them if the dataset is big enough, use robust scaler to reduce the influence of outliers or to use a weighted kNN. Using a weighted kNN will make the kNN give a higher weight to the closer neighbors than to the outliers.

Next, we will take a look at the binary features. Figure 6 shows us the balance between classes 0 and 1.

Code
import pandas as pd
import matplotlib.pyplot as plt

# Filter binary columns 
binary_cols = [col for col in cdc_data_df.columns 
               if set(cdc_data_df[col].dropna().unique()).issubset({0, 1})]

# Count 0s and 1s
binary_counts = pd.DataFrame({
    '0': (cdc_data_df[binary_cols] == 0).sum(),
    '1': (cdc_data_df[binary_cols] == 1).sum()
})

# Plot
binary_counts.plot(kind='barh', stacked=True, figsize=(10, 8), colormap='Set2')
plt.title("Figure 6: Binary Feature Distribution of 0 and 1")
plt.xlabel("Count")
plt.ylabel("Feature")
plt.legend(title="Class")
plt.tight_layout()
plt.show()

In Figure 4 we can see there is an imbalance in our target variable Diabetes_binary. Only 13.9% of people have diabetes or pre-diabetes denoted by a 1. This imbalance can lead to biased model predictions, favoring the dominant class while under-detecting diabetes cases.To address this, techniques such as oversampling (SMOTE) or undersampling should be considered to improve classification performance.

Correlation Analysis

A correlation heatmap was generated in Figure 7 to examine relationships between variables. The correlation heatmap helps identify strongly correlated features, which may lead to redundancy in the model.

Code
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Compute correlation matrix 
corr_matrix = cdc_data_df.corr()

# Plot the heatmap
plt.figure(figsize=(12, 8))
sns.heatmap(corr_matrix, annot=True, fmt=".2f", cmap="coolwarm", linewidths=0.5, vmin=-1, vmax=1)
plt.title("Figure 7: Feature Correlation Heatmap")
plt.show()

Positive Correlations:

General Health (GenHlth) is strongly correlated with Physical Health (PhysHlth) (0.52) and Difficulty Walking (DiffWalk) (0.45).

As individuals report poorer general health, they experience more physical health issues and mobility limitations.

Physical Health (PhysHlth) and Difficulty Walking (DiffWalk) (0.47) show a strong link. Those with more days of poor physical health are likely to struggle with mobility.

Age correlates with High Blood Pressure (0.34) and High Cholesterol (0.27), indicating an increased risk of cardiovascular conditions as people get older.

Mental Health (MentHlth) and Physical Health (PhysHlth) (0.34) are positively associated. Worsening mental health often coincides with physical health problems.

Negative Correlations:

• Higher Income is associated with better General Health (-0.33), fewer Mobility Issues (-0.30), and better Physical Health (-0.24).

This suggests financial stability improves access to healthcare and promotes a healthier lifestyle.

• Higher Education is linked to better General Health (-0.28) and Mental Health (-0.19). Educated individuals may have better health awareness and coping strategies.

The heatmap confirms well-known health trends: age, high blood pressure, and cholesterol are major risk factors for diabetes. Poor physical and mental health are strongly related, and socioeconomic status (income, education) plays a key role in overall health. These insights highlight the importance of early intervention strategies and lifestyle modifications to prevent chronic diseases like diabetes.

Since we have no correlation over 0.5, that means multicollinearity is not a major issue, and we don’t need to remove any variables.

BMI Distribution and Density Analysis by Diabetes Status

BMI is a known risk factor for diabetes, and the analysis confirms that individuals with diabetes tend to have slightly higher BMI values on average. The KDE (Kernel Density Estimate) plot visualizes the distribution of BMI values for individuals with and without diabetes (or prediabetes).

Code
import matplotlib.pyplot as plt
import seaborn as sns

# Define the target variable
target_variable = "Diabetes_binary"

# Ensure that target_variable is in the dataframe
if target_variable in cdc_data_df.columns:
    plt.figure(figsize=(8, 5))
    sns.boxplot(x=target_variable, y="BMI", data=cdc_data_df, 
                hue=target_variable, palette="Set3", legend=False)
    
    plt.title("Figure 8: BMI Distribution by Diabetes Status")
    plt.xlabel("Diabetes Status (0 = No, 1 = Diabetes/Prediabetes)")
    plt.ylabel("BMI")
    plt.show()
else:
    print("Target variable not found in dataset.")

Code
    
# Set figure size
plt.figure(figsize=(10, 6))

# KDE plot for BMI distribution by diabetes status
sns.kdeplot(data=cdc_data_df[cdc_data_df['Diabetes_binary'] == 0]['BMI'], 
            label='No Diabetes (0)', color="mediumaquamarine", fill=True)

sns.kdeplot(data=cdc_data_df[cdc_data_df['Diabetes_binary'] == 1]['BMI'], 
            label='Diabetes/Prediabetes (1)', color="salmon", fill=True)

# Titles and labels
plt.title('Figure 9: BMI Density by Diabetes Status', fontsize=16)
plt.xlabel('BMI', fontsize=14)
plt.ylabel('Density', fontsize=14)
plt.legend(title='Diabetes Status')

# Show plot
plt.show()

The analysis of BMI distribution across individuals with and without diabetes highlights some key trends:

1. General BMI Trends:

Diabetic individuals tend to have a slightly higher median BMI compared to non-diabetic individuals.

A significant portion of diabetic individuals have a BMI above 30, aligning with known research that obesity is a major risk factor for diabetes.

2. Presence of Outliers:

Both groups contain extreme BMI values, particularly in the severely obese range.

These extreme values may disproportionately affect model performance and should be further investigated.

If necessary, outlier removal or transformation techniques (e.g., log transformation, winsorization) could be applied to maintain dataset balance.

3. BMI and Diabetes Relationship:

The KDE density plot reveals that while individuals with diabetes generally have higher BMI values, the overall distribution still shows overlap between the two groups.

The density shift towards higher BMI values (above 30) for diabetic individuals suggests an association between obesity and diabetes risk.

4. Overlap Between Groups:

Despite the observed trends, BMI alone does not serve as a strong distinguishing factor for diabetes, as there is a significant overlap in distributions.

Other factors, such as age, cholesterol levels, and physical activity, should also be considered to improve the predictive accuracy of diabetes risk assessment.

Key Findings from Data Exploration and Visualizations:

Class Imbalance:

Only 13.9% of people have diabetes, which suggests an imbalance in the target variable. This may require oversampling (SMOTE) or class weighting when training models.

BMI and High Blood Pressure are Major Health Concerns:

  • The average BMI is 28.38, close to the overweight range.
  • 43% of the population has high blood pressure, which is a known risk factor for diabetes.

Physical Activity and Diet Indicators:

  • 75% of individuals engage in regular physical activity.
  • 81% eat vegetables regularly, and 63% eat fruits regularly, suggesting generally healthy dietary habits.

Age and Income Influence Health Outcomes:

  • Older individuals are more likely to develop diabetes.
  • Higher income groups tend to report better health, which may correlate with healthcare access.

The next phase involves data preprocessing, feature selection, and model development to enhance predictive performance.

Modeling and Results

Models

We chose to create four classification kNN models to illustrate the methodology. Model 1 will be our baseline model. Model 2 shows us the impact of changing the k value on the classification model. Model 3 shows us how adjusting the class imbalance affects the model. Model 4 shows us how changing the decision threshold affects the classification of diabetes.

Table 4 shows us a summary of the models.

Code
#Load gt package
library(gt)

#Create dataframe
model_sum <- data.frame(
  Model = c("Model 1", "Model 2", "Model 3", "Model 4" ),
  k = c(5, 15, 15, 15), Weights = c("'uniform'", "'uniform'", "'distance'","'uniform'"),Distance = rep("Euclidean", 4),SMOTE = c("No", "No", "Yes","Yes"), 'Decision_Threshold' = c(0.5, 0.5, 0.5,0.2))

# Create Table 
model_sum %>%
  gt() %>%tab_header(title = md("**Table 4: Model Summary**")
  ) %>%cols_label(Model = "Model Name",k = "k value",Weights = "Weights", Distance = "Distance",SMOTE = "SMOTE",`Decision_Threshold` = "Decision Threshold") %>%cols_align(align = "center",columns = everything()) %>%tab_options(table.border.top.width = px(2),table.border.bottom.width = px(2),heading.align = "left")
Table 4: Model Summary
Model Name k value Weights Distance SMOTE Decision Threshold
Model 1 5 'uniform' Euclidean No 0.5
Model 2 15 'uniform' Euclidean No 0.5
Model 3 15 'distance' Euclidean Yes 0.5
Model 4 15 'uniform' Euclidean Yes 0.2
Data Preprocessing

In our analysis of the data we saw that there was no missing data so we didn’t have to remove or impute any values. We did however find the there were duplicate values. We started cleaning the data by dropping these duplicates. Before removing the duplicates, the dataset contained 253,680 survey responses, with an 86.07% majority class (No Diabetes) and 13.93% minority class (Diabetes/Prediabetes). However, after removing duplicate rows, the total dataset size decreased, leading to a slight shift in class distribution to 84.70% (No Diabetes) and 15.29% (Diabetes/Prediabetes). This change occurred because duplicate entries were not evenly distributed across both classes—more duplicates existed in the majority class. As a result, removing them slightly increased the proportion of the minority class. This step ensured a cleaner dataset while preserving meaningful class representation for further analysis.

The data also consisted of mostly numeric features such as the binary features and the ordinal categorical variables age, education, income and GenHlth. We kept the ordinal variables the same as they have a meaningful natural order that will provide the kNN with meaningful distances. Next, we divided the data into testing and training data. We used test_size=0.2 to use 80% of the data for training the kNN and 20% of the data for testing. Finally, in Figure 2 we saw the features needed to be standardized or normalized. We chose to standardize them so that BMI and age could be on the same scale as the other features. You can view the pre-processing code for model 1 and 2 below.

Code
# Pre-processing for Model 1 and Model 2

# Import libraries
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from ucimlrepo import fetch_ucirepo 

# Load the dataset  
cdc_diabetes_health_indicators = fetch_ucirepo(id=891) 

# Create dataframe by combining features and targets
cdc_data_df = pd.concat(
    [cdc_diabetes_health_indicators.data.features, 
     cdc_diabetes_health_indicators.data.targets],
    axis=1
)

# Drop duplicate rows
cdc_data_df.drop_duplicates(inplace=True)

# Separate features and target 
X = cdc_data_df.drop(columns="Diabetes_binary")
y = cdc_data_df["Diabetes_binary"]

# Split training and testing data with an 80/20 mix
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Standardize features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

For model 3 and 4 we chose to apply SMOTE to deal with the class imbalance. We chose to go with this because there are very few diabetes samples in our data and that might impact the accuracy of the results.

Code
#Extra pre-processing step for Model 3 & 4 ONLY
from imblearn.over_sampling import SMOTE

# Apply SMOTE to training data to use for Model 3 & 4
smote = SMOTE(random_state=42)
X_train_smote, y_train_smote = smote.fit_resample(X_train_scaled, y_train)

After applying SMOTE, the class distribution of the target variable Diabetes_binary became balanced, with equal representation of both classes. This prevents the model from being biased towards the majority class and ensures better learning from the minority class. In theory, the balanced dataset should improve model performance, particularly in recall and overall classification metrics.

Training the kNN Models
Model 1

Model 1 is our baseline model and has an arbitrary k value of 5. It uses the euclidean distance to measure all the distances between the training dataset we specified and assigns a “uniform” weight to all the distances which means they are all the same importance. This model includes the imbalanced target dataset.

Code
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, accuracy_score, f1_score, confusion_matrix 

# Model 1: 
knn_model_1 = KNeighborsClassifier(n_neighbors=5, weights='uniform', metric='euclidean')
knn_model_1.fit(X_train_scaled, y_train)
KNeighborsClassifier(metric='euclidean')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Code
y_pred1 = knn_model_1.predict(X_test_scaled)
print(classification_report(y_test, y_pred1))
              precision    recall  f1-score   support

           0       0.87      0.94      0.91     38876
           1       0.41      0.21      0.28      7019

    accuracy                           0.83     45895
   macro avg       0.64      0.58      0.59     45895
weighted avg       0.80      0.83      0.81     45895
Code
print(confusion_matrix(y_test, y_pred1))
[[36716  2160]
 [ 5539  1480]]
Code
#Grid Search 

from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, make_scorer, recall_score

#parameters we will test 
knn_params = {'n_neighbors': [3, 5, 15, 20, 45],'weights': ['uniform', 'distance'],'metric': ['euclidean', 'manhattan', 'minkowski']}

# optimize recall for diabetes cases 
recall_class1 = make_scorer(recall_score, pos_label=1)

#initialize knn model 
knn = KNeighborsClassifier()

#grid search 
grid_search = GridSearchCV(estimator=knn,param_grid=knn_params,scoring=recall_class1,cv=5,n_jobs=-1,verbose=1)
grid_search.fit(X_train_smote, y_train_smote)
GridSearchCV(cv=5, estimator=KNeighborsClassifier(), n_jobs=-1,
             param_grid={'metric': ['euclidean', 'manhattan', 'minkowski'],
                         'n_neighbors': [3, 5, 15, 20, 45],
                         'weights': ['uniform', 'distance']},
             scoring=make_scorer(recall_score, response_method='predict', pos_label=1),
             verbose=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Code
#Print best parameters
print("Optimal parameters", grid_search.best_params_)
Optimal parameters {'metric': 'euclidean', 'n_neighbors': 3, 'weights': 'distance'}
Model 2

We chose to do gridsearch to find the optimum hyperparameters for our dataset. We found that the optimum k value was 15 so we chose to increase the k value for model 2 to 15. In theory adjusting this k value should improve the accuracy because now it will use the majority between the closest 15 neighbors. If the k value of 5 somehow missed an important neighbor then using a larger k value should help remedy this. We are still using the euclidean distance, a “uniform” weight, and the target variable is still imbalanced.

Code
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, accuracy_score, f1_score, confusion_matrix

# Model 2: 
knn_model_2 = KNeighborsClassifier(n_neighbors=15, weights='uniform', metric='euclidean')
knn_model_2.fit(X_train_scaled, y_train)
KNeighborsClassifier(metric='euclidean', n_neighbors=15)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Code
y_pred2 = knn_model_2.predict(X_test_scaled)
print(classification_report(y_test, y_pred2))
              precision    recall  f1-score   support

           0       0.86      0.97      0.91     38876
           1       0.48      0.15      0.22      7019

    accuracy                           0.85     45895
   macro avg       0.67      0.56      0.57     45895
weighted avg       0.81      0.85      0.81     45895
Code
print(confusion_matrix(y_test, y_pred2))
[[37785  1091]
 [ 5997  1022]]
Model 3

Model 3 attempts to show us how using SMOTE to try to balance the target variable affects the model. It uses the same k value as model 2 but uses a “distance” weight instead of a “uniform” weight. The distance weight gives more influence to the closest neighbors.

Code
# Model 3:
knn_model_3 = KNeighborsClassifier(n_neighbors=15, weights='distance', metric='euclidean')
knn_model_3.fit(X_train_smote, y_train_smote)
KNeighborsClassifier(metric='euclidean', n_neighbors=15, weights='distance')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Code
y_pred_3 = knn_model_3.predict(X_test_scaled)
print(classification_report(y_test, y_pred_3))
              precision    recall  f1-score   support

           0       0.92      0.67      0.78     38876
           1       0.28      0.70      0.40      7019

    accuracy                           0.68     45895
   macro avg       0.60      0.69      0.59     45895
weighted avg       0.83      0.68      0.72     45895
Code
print(confusion_matrix(y_test, y_pred_3))
[[26217 12659]
 [ 2135  4884]]
Model 4

Model 4 attempts to increase the recall of the model by using all the optimal hyperparameters discovered through gridsearch. It also addresses the imbalanced class dataset by using SMOTE and adjusting the decision threshold from the default of 0.5 to 0.2. We made this change because our research says that a decision threshold of 0.5 is not optimal for classification models trained on imbalanced data. (Esposito et al. 2021)

Code
#Model 4

knn_model_4 = KNeighborsClassifier(n_neighbors=15, weights='uniform', metric='euclidean')
knn_model_4.fit(X_train_smote, y_train_smote)
KNeighborsClassifier(metric='euclidean', n_neighbors=15)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Code
# Adjust decision threshold
y_prob_4 = knn_model_4.predict_proba(X_test_scaled)[:, 1]
threshold = 0.2
y_pred_4 = (y_prob_4 >= threshold).astype(int)
print(classification_report(y_test, y_pred_4))
              precision    recall  f1-score   support

           0       0.96      0.40      0.56     38876
           1       0.21      0.92      0.35      7019

    accuracy                           0.48     45895
   macro avg       0.59      0.66      0.45     45895
weighted avg       0.85      0.48      0.53     45895
Code
print(confusion_matrix(y_test, y_pred_4))
[[15379 23497]
 [  588  6431]]
Evaluating and Comparing the models

Now that the four models have been trained we can focus on evaluating them. Table 5 below shows the summary of the four models.

Code
# import libraries
import pandas as pd
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, roc_auc_score

# Create dataframe 
results = pd.DataFrame({
    'Model': ['Model 1', 'Model 2', 'Model 3', 'Model 4'],'k': [5, 15, 15, 15],
    'Weight': ['Uniform', 'Uniform', 'Distance', 'Uniform' ],
    'SMOTE': ['No', 'No', 'Yes', 'Yes'],
    'Decision Threshold': ['0.5', '0.5', '0.5', '0.2'],
    'Accuracy': [
        accuracy_score(y_test, y_pred1),
        accuracy_score(y_test, y_pred2),
        accuracy_score(y_test, y_pred_3),
        accuracy_score(y_test, y_pred_4)
    ],
    'F1 Score': [
        f1_score(y_test, y_pred1),
        f1_score(y_test, y_pred2),
        f1_score(y_test, y_pred_3),
        f1_score(y_test, y_pred_4)
    ],
    'Precision': [
        precision_score(y_test, y_pred1),
        precision_score(y_test, y_pred2),
        precision_score(y_test, y_pred_3),
        precision_score(y_test, y_pred_4)
    ],
    'Recall': [
        recall_score(y_test, y_pred1),
        recall_score(y_test, y_pred2),
        recall_score(y_test, y_pred_3),
        recall_score(y_test, y_pred_4)
    ],
    'ROC AUC': [
        roc_auc_score(y_test, knn_model_1.predict_proba(X_test_scaled)[:, 1]),
        roc_auc_score(y_test, knn_model_2.predict_proba(X_test_scaled)[:, 1]),
        roc_auc_score(y_test, knn_model_3.predict_proba(X_test_scaled)[:, 1]),
        roc_auc_score(y_test, knn_model_4.predict_proba(X_test_scaled)[:, 1])
        
    ]
})

# Create the table 
results.style \
    .set_caption("KNN Model Performance Summary") \
    .format({
        "Accuracy": "{:.2%}",
        "F1 Score": "{:.2%}",
        "Precision": "{:.2%}",
        "Recall": "{:.2%}",
        "ROC AUC": "{:.2f}"
    }) \
    .set_properties(**{'text-align': 'center'}) \
    .set_table_styles([
        {'selector': 'caption', 'props': [('caption-side', 'top'), ('font-weight', 'bold'), ('font-size', '16px')]},
        {'selector': 'th', 'props': [('background-color', '#f2f2f2'), ('font-size', '14px')]}
    ])
Table 1: KNN Model Performance Summary
  Model k Weight SMOTE Decision Threshold Accuracy F1 Score Precision Recall ROC AUC
0 Model 1 5 Uniform No 0.5 83.22% 27.77% 40.66% 21.09% 0.71
1 Model 2 15 Uniform No 0.5 84.56% 22.38% 48.37% 14.56% 0.77
2 Model 3 15 Distance Yes 0.5 67.77% 39.77% 27.84% 69.58% 0.74
3 Model 4 15 Uniform Yes 0.2 47.52% 34.81% 21.49% 91.62% 0.75

Results

We evaluated the models based on accuracy, F1 score, precision, recall and ROC AUC score. We paid particular attention to accuracy and recall and tried to find the best balance between these two evaluation metrics.

Model 1

Model 1 is our baseline model with a small k parameter of 5. It got 83.22% for accuracy which means it was correct 83.22% of the time when classifying both positive and negative diabetes cases. The F1 score was 27.77% which is very low. This model is struggling to balance between precision and recall. This makes sense since we know that the model is trained on an imbalance data set. The precision is at 40.66% which means only about 40.66% of the positive predictions were actually positive. This means we have a lot of false positives. The recall for model 1 was 21.09% which means that out of all the true positives that existed in the data set, the model only detected 21.09% of them. It also had an ROC AUC score of 0.71 which means that it’s able to distinguish between classes about 71% of the time. In summary, model 1 is accurate at classifying non-diabetic cases but is very bad at classifying diabetes.

Model 2

Model 2 is the same as the baseline model 1 except that the k parameter was increased to 15 based on the optimal parameter we found using gridsearch. Increasing the k parameter to 15 increased the accuracy to 84.56%, this made model 2 have the highest accuracy score but it’s only high because like model 1 it’s also good at detecting the non-diabetic cases which are the majority of cases. The F1 score is at 22.38% so it’s even worse than model 1 at balancing between precision and recall. It does have a higher precision than model 1 so it does have fewer false positives. The recall is even lower than model 1 at 14.56%. This means it’s missing most of the positive diabetes cases. Model 2 also has the highest ROC AUC score of 0.77 which means it’s the best model at separating different classes. Since the purpose of using the kNN is to detect or screen for diabetes we wouldn’t want to use this model.

Model 3

Model 3 also uses 15 as the k parameter but it changes the weight from “normal” to “distance” which gives more influence to the closest neighbors when classifying the unknown data point. It also uses SMOTE to try to balance the data set. Changing these two things about the model increased the recall to 69.78% but it decreased the overall accuracy to 67.77%. This means that model 3 was able to correctly identify about 70% of the positive diabetes cases while keeping the accuracy at about 68%. It also has an improved F1 score of 39.77% which means it’s better than the first two models at balancing between precision and recall. The precision dropped to 27.84% so this model has more false positives. The ROC AUC score dropped to 0.74 which means this model is decent at separating the classes. This model is a better choice when it comes to classifying diabetes since it’s recall and accuracy are more balanced and it catches about 70% of diabetes cases.

Model 4

Model 4 changed the weight measure back to “normal” which means it’s giving equal influence to all data points regardless of distance. We also decreased the decision threshold to 0.2 to try to compensate for the imbalanced data set. We can see that reducing the decision threshold to 0.2 drastically increased our recall with model 4 being able to classify 91.62% of the diabetes cases; however, this came at the expense of overall accuracy since it fell to 47.52%. This means that a lot of non-diabetic cases are being classified as having diabetes. The ROC AUC score is comparable to model 3 at 0.75. The F1 score fell a little bit to 34.81 but is still decent. The precision is very low at 21.49 so we are getting a lot of false positives; however, we can argue that this model would be good as a screening tool since people can have an obese BMI, sedentary lifestyle, low veggie consumption, and high caloric intake for years before they actually develop diabetes. This model would be good at screening people for being high risk for developing diabetes.

Conclusion

kNN’s are a simple but powerful algorithms that can be used for regression and classification problems. In classification problems the kNN’s compute the distance between data points with the distance formula the user specifies and then use the majority of the k nearest neighbor also specified by the user to classify the unknown data point. This algorithm has proven useful in many industries such as healthcare. In this project we created four kNN models that were trained to classify unknown datapoints into diabetes or non-diabetes classes using the data from UC Irvine’s Machine Learning Repository called CDC Diabetes Health indicators. We were able to see how fine tuning a kNN model can help us classify diabetes in a healthcare or public health setting. Model 3 is our most balanced model and shows potential with classifying diabetic cases but would need to be furthered improved to be used in a healthcare setting. Model 4 shows the highest potential to be used in a public health setting where it can be used to screen people at high risk of developing diabetes. It is able to classify the confirmed diabetes cases with 91.62% accuracy which is very useful in the field of public health where a lot of interventions come from identifying people who might be at risk for diabetes. Model 4 shows the potential to be used in a public health program such as WIC where participants answer health and nutrition questions as part of their participation and the counselor tailors their nutrition education and offers referrals based on these results. It can potentially identify people at high risk for developing diabetes and can lead them to get referred to a lifestyle modification class or be referred for a biochemical testing to confirm if they have diabetes.

References

Ali, AMEER, MOHAMMED Alrubei, LF Mohammed Hassan, M Al-Ja’afari, and Saif Abdulwahed. 2020. “Diabetes Classification Based on KNN.” IIUM Engineering Journal 21 (1): 175–81.
Altamimi, Abdulaziz, Aisha Ahmed Alarfaj, Muhammad Umer, Ebtisam Abdullah Alabdulqader, Shtwai Alsubai, Tai-hoon Kim, and Imran Ashraf. 2024. “An Automated Approach to Predict Diabetic Patients Using KNN Imputation and Effective Data Mining Techniques.” BMC Medical Research Methodology 24 (1): 221.
Boateng, Ernest Yeboah, Joseph Otoo, and Daniel A Abaye. 2020. “Basic Tenets of Classification Algorithms k-Nearest-Neighbor, Support Vector Machine, Random Forest and Neural Network: A Review.” Journal of Data Analysis and Information Processing 8 (4): 341–57.
Deng, Zhenyun, Xiaoshu Zhu, Debo Cheng, Ming Zong, and Shichao Zhang. 2016. “Efficient kNN Classification Algorithm for Big Data.” Neurocomputing 195: 143–48.
Esposito, Carmen, Gregory A Landrum, Nadine Schneider, Nikolaus Stiefl, and Sereina Riniker. 2021. “GHOST: Adjusting the Decision Threshold to Handle Imbalanced Data in Machine Learning.” Journal of Chemical Information and Modeling 61 (6): 2623–40.
Iparraguirre-Villanueva, Orlando, Karina Espinola-Linares, Rosalynn Ornella Flores Castañeda, and Michael Cabanillas-Carbonell. 2023. “Application of Machine Learning Models for Early Detection and Accurate Classification of Type 2 Diabetes.” Diagnostics 13 (14): 2383.
Kataria, Aman, and MD Singh. 2013. “A Review of Data Classification Using k-Nearest Neighbour Algorithm.” International Journal of Emerging Technology and Advanced Engineering 3 (6): 354–60.
Khateeb, Nida, and Muhammad Usman. 2017. “Efficient Heart Disease Prediction System Using k-Nearest Neighbor Classification Technique.” In Proceedings of the International Conference on Big Data and Internet of Thing, 21–26.
Mucherino, Antonio, Petraq J Papajorgji, Panos M Pardalos, Antonio Mucherino, Petraq J Papajorgji, and Panos M Pardalos. 2009. “K-Nearest Neighbor Classification.” Data Mining in Agriculture, 83–106.
Panwar, Madhuri, Amit Acharyya, Rishad A Shafik, and Dwaipayan Biswas. 2016. “K-Nearest Neighbor Based Methodology for Accurate Diagnosis of Diabetes Mellitus.” In 2016 Sixth International Symposium on Embedded Computing and System Design (ISED), 132–36. IEEE.
Saxena, Krati, Zubair Khan, and Shefali Singh. 2014. “Diagnosis of Diabetes Mellitus Using k Nearest Neighbor Algorithm.” International Journal of Computer Science Trends and Technology (IJCST) 2 (4): 36–43.
Suriya, S, and J Joanish Muthu. 2023. “Type 2 Diabetes Prediction Using k-Nearest Neighbor Algorithm.” Journal of Trends in Computer Science and Smart Technology 5 (2): 190–205.
Syriopoulos, Panos K, Nektarios G Kalampalikis, Sotiris B Kotsiantis, and Michael N Vrahatis. 2023. “K NN Classification: A Review.” Annals of Mathematics and Artificial Intelligence, 1–33.
Theerthagiri, Prasannavenkatesan, A Usha Ruby, and J Vidya. 2022. “Diagnosis and Classification of the Diabetes Using Machine Learning Algorithms.” SN Computer Science 4 (1): 72.
Uddin, Shahadat, Ibtisham Haque, Haohui Lu, Mohammad Ali Moni, and Ergun Gide. 2022. “Comparative Performance Analysis of k-Nearest Neighbour (KNN) Algorithm and Its Different Variants for Disease Prediction.” Scientific Reports 12 (1): 6256.
Zhang, Shichao, Xuelong Li, Ming Zong, Xiaofeng Zhu, and Ruili Wang. 2017. “Efficient kNN Classification with Different Numbers of Nearest Neighbors.” IEEE Transactions on Neural Networks and Learning Systems 29 (5): 1774–85.
Zhang, Zhongheng. 2016. “Introduction to Machine Learning: K-Nearest Neighbors.” Annals of Translational Medicine 4 (11).