Synthetic data can recapitulate statistical properties and capture the complexity of clinical and genomic features in hematologic malignancies, according to a study published in Blood.1 Synthetic datasets were found to replicate reliable survival estimates and provide effective data augmentation. It is hoped that the quick adoption of synthetic datasets may help to further accelerate precision medicine efforts in hematology.
Clinicians have been anxious to see real-world, comprehensive clinical and genomic data for hematologic malignancies. These types of data could help improve diagnosis, prognosis and personalized treatment options. It has now been demonstrated that synthetic data can capture the complexities of an original dataset.
D’Amico and colleagues compared the quality of synthetic data to the first publication on clinical relevance of gene mutations in myelodysplastic syndrome,2 as well as a large cohort with molecular classifications3 and prognostic scores.4
The team generated a 300% augmented synthetic dataset from the myelodysplastic syndrome (MDS) cohort available in 2014 (n=944).2 Patients were stratified by Hierarchical Dirichlet (HD) clustering and it was possible to identify the same 8 subgroups described in a real cohort of 2043 patients years later.
Survival analyses were performed by Kaplan-Meier curves and a CoxPH model was applied to the synthetic dataset to generate a molecular prognostic score (IPSS-M_Syn). The model was based on similar molecular features as the real IPSS-M and identified 6 risk categories in which the probability of survival was similar to that of IPSS-M risk groups.
With different cohorts of patients with MDS and acute myeloid leukemia (AML), the team created a Synthetic Validation Framework to evaluate the quality of generated synthetic data. They looked at Clinical Synthetic Fitness (CSF) and Genomic Synthetic Fitness (GSF) scores and calculated them into their equations.
After developing a synthetic copy of a MDS cohort (n=2043), they compared synthetic to real data, obtaining high fitness performances for both clinical and genomic features (CSF=93%; GSF=90%). Synthetic patients had comparable survival to the actual patients when applying conventional scoring system (IPSS-R), and the probability of survival was comparable between synthetic and actual patient data.
The team also analyzed synthetic MDS datasets with a model trained on a real dataset and generated a synthetic augmented dataset (200%), which demonstrated high fitness performance for both clinical and genomic features (CSF=91%; GSF=89%).
A similar trend was found in a cohort of 1002 patients with AML (CSF=92%; GSF=89%), “proving evidence for high generalizability of the model across different clinical settings,” the authors noted in their report.