Assistant Professor University of Florida, United States
Introduction: Bacterial vaginosis (BV) is a common vaginal syndrome affecting reproductive-age women globally. It is associated with various adverse obstetric and gynecological outcomes including increased risk of sexually transmitted infections, HIV, cervical cancer, and preterm birth. Diagnosis traditionally relies on Amsel's criteria or Nugent score, but variations in vaginal microbiome composition based on race and ethnicity challenge these approaches. Black women, for example, tend to have a more diverse vaginal microbiome even when healthy, raising questions about diagnostic accuracy and therapeutic approaches across ethnic groups. Sequencing technologies combined with machine learning (ML) offer new avenues for predictive models, but disparities in diagnosis persist. AI and ML in healthcare can perpetuate these disparities due to biased data, discriminatory algorithms, and inadequate evaluation practices. Here we perform a rigorous comparison of machine learning architectures and feature selection methods used to train models on 16s rRNA data of patients presenting with BV. We test how feature selection affects model performance, and identify the features considered significant by the best selection-model combination.
Materials and
Methods: The data used in this study was originally produced by Ravel et al. and contains 16s rRNA sequencing counts, Nugent scores, and ethnic group labels for 396 women both with and without Bacterial Vaginosis. 16s rRNA data was divided by 100 to normalize. Patients with Nugent scores of 7+ were identified as BV positive, all others were BV negative. In order to evaluate each model’s performance across ethnic groups, and to determine the best performing model for each ethnic group, models were trained and tested on datasets containing only patients from a specific ethnic group. This research studies the effectiveness of four machine learning models and five statistical feature selection tests across each ethnic group. These architectures include logistic regression (LR), random forest (RF), simple vector machine (SVM), and multilayer perceptron (MLP). Hyperparameters were tested using the GridSearchCV function with n = 5 fold cross-validation. Feature selection methods include ANOVA F-test (F Test), two-sided T-Test (T-Test), Point Biserial correlation (PB Corr), Point Biserial significance (PB Sig), and Gini impurity (Gini). Models were trained and tested using either the entire dataset or ethnic group specific subsets. Models were trained and assessed using five datasets, four architectures, and six feature selection methods. For statistical analysis, n = 10 replicates were produced for each model by changing the random state from 0 to 9. In total 1,200 models were trained and evaluated, 240 for each ethnic group. Models were then assessed using balanced accuracy.
Results, Conclusions, and Discussions: In general, boxplots for the Total, White, and Black datasets are centered around a balanced accuracy of 0.9 with a moderate spread. The Asian and Hispanic datasets generally had worse-performing models. The Hispanic dataset models are centered around 0.8 with smaller spreads. Models trained on the Asian dataset performed with a large variety and a larger spread, centering around 0.75.
Feature selection impact on these models varies greatly for each dataset and architecture. For most models trained on the White dataset, feature selection does not improve much, and in some cases has a negative impact. For the Black dataset, model architectures are each affected differently by each feature selection method. For the Asian dataset, the mean balanced accuracies of the LR, RF, and MLP models are increased by feature selection.The models trained on the Hispanic dataset show little improvement from feature selection. The best model-feature selection combination for each group are LR-Pbsig for the Total set, SVM-Ttest for the White set, RF-Pbsig for the Black set, LR-Pbsig for the Asian set, and SVM-Pbsig for the Hispanic set. These models performed with a balanced accuracy of 0.91, 0.96, 0.94, 0.85, and 0.83 respectively.
Feature selection also allows us to identify groups of bacteria that may be important in the presentation of BV across different populations. Prevotella, Megasphaera, Atopobium, Eggerthella, Dialister, Ruminococcaceae, and Sneathia are the most important bacteria to consider abundance from in regards to classifying BV. Peptoniphilus and Lactobacillales are a cluster pair that is also important for the Black dataset but not as important for other datasets. Similarly, Anaeroglobus, Aerococcus, Prevotellaceae, and Lachnospiraceae form a cluster that is important only for classifying BV in the White dataset. Parvimonas is also important for the White dataset, but is clustered separately. For the Total dataset, a cluster containing Coriobacteriaceae, L. Crispatus, Prevotellaceae, Bacteroidetes, Arcanobacterium, Lachnospiraceae, Bulleidia, and Mobiluncus is identified as important.
This study uses a rigorous comparison of methods to optimize the performance of ML models and reduce the population-based disparities seen when using a one-size-fits-all approach to classifier development. It also reveals important bacteria for further BV study.