
Leveraging network science to enrich symptom-based disease prediction
Graph metrics surface the structure behind symptoms, enabling leaner-yet-accurate diagnoses.
Mapped 773 diseases and 377 symptoms into a bipartite network to engineer graph-aware features for diagnosis. Method of Reflections, Disease/Symptom Influence indices, community detection and betweenness centrality drive new descriptors that complement one-hot symptoms. Logistic Regression, Random Forest and MLP models were benchmarked; the best logistic model matches the symptom-only baseline while using fewer inputs and exposes class-level accuracy insights.
The project stages a complete pipeline: build a bipartite graph from 246k synthetic medical cases, compute Method of Reflections iterations to derive Symptom Influence (SI) and Disease Influence (DI) scores, extract Louvain-like communities, betweenness centrality and null-model comparisons. These graph descriptors are fused with one-hot symptoms and filtered through forward stepwise selection to curb dimensionality. Logistic Regression, Random Forest and MLP models are then tuned via grid search; the chosen logistic model with network features matches the accuracy of a symptom-only baseline while keeping overfitting low. A post-hoc analysis breaks down disease-wise accuracy, highlights confusion on overlapping pathologies (e.g., bladder cancer vs. diabetes insipidus) and studies how feature truncation affects accuracy and training time.