Disease Prediction with Graph Machine Learning
Graph ML

Disease Prediction with Graph Machine Learning

Leveraging network science to enrich symptom-based disease prediction

Graph metrics surface the structure behind symptoms, enabling leaner-yet-accurate diagnoses.

Home/Research/Disease Prediction with Graph Machine Learning

Project Information

Course
Financial Data Science
Authors
Andrea Alberti, Davide Ligari, Cristian Andreoli, Matteo Scardovi
Date
December 2023
Pages
23
View Code

Technologies

PythonNetworkXScikit-learnNumPyPandasMatplotlibSeaborn

Abstract

Mapped 773 diseases and 377 symptoms into a bipartite network to engineer graph-aware features for diagnosis. Method of Reflections, Disease/Symptom Influence indices, community detection and betweenness centrality drive new descriptors that complement one-hot symptoms. Logistic Regression, Random Forest and MLP models were benchmarked; the best logistic model matches the symptom-only baseline while using fewer inputs and exposes class-level accuracy insights.

About

The project stages a complete pipeline: build a bipartite graph from 246k synthetic medical cases, compute Method of Reflections iterations to derive Symptom Influence (SI) and Disease Influence (DI) scores, extract Louvain-like communities, betweenness centrality and null-model comparisons. These graph descriptors are fused with one-hot symptoms and filtered through forward stepwise selection to curb dimensionality. Logistic Regression, Random Forest and MLP models are then tuned via grid search; the chosen logistic model with network features matches the accuracy of a symptom-only baseline while keeping overfitting low. A post-hoc analysis breaks down disease-wise accuracy, highlights confusion on overlapping pathologies (e.g., bladder cancer vs. diabetes insipidus) and studies how feature truncation affects accuracy and training time.

Key Results

Logistic Regression
Best Model
28%
Feature Reduction
1.5%
Accuracy Drop
9.4%
Training Time Reduction

Key Findings

  • A 246k-sample synthetic dataset was modelled as a symptom–disease bipartite graph, generating Method of Reflections SI/DI scores, community IDs and betweenness metrics to augment one-hot symptoms.
  • Forward stepwise selection identified the most informative mix of graph and symptom features, keeping the logistic baseline accuracy while reducing dimensionality and training cost.
  • Per-disease diagnostics exposed classes with perfect recall (mitral valve disease, acute bronchospasm) and highlighted why bladder cancer remains difficult due to overlapping symptom profiles.

Methodology

Built a 773×377 symptom–disease bipartite graph (246k cases), derived Method of Reflections SI/DI metrics, communities and betweenness features, fused them with one-hot symptoms, then applied forward stepwise selection before benchmarking Logistic Regression, Random Forest and MLP classifiers.
Disease Prediction with Graph Machine Learning | Andrea Alberti | Andrea Alberti