6.793/HST.956 2024 Potential Class Projects

In our lecture on February 22, you heard Drs. Celi and Dale describe potential projects for our students to work on this semester. In addition, Dr. Jason Qian, a urological surgeon at Brigham & Women's (MGB) suggested additional possible projects for which there are publicly available data. Below is a listing of all these, and we will have a preference list for you to fill out indicating which ones are of interest to you. Remember that if you have an interesting problem and data set, you could also propose that, but should include other students from the class.

From Dr. Leo Celi

For the projects proposed by Dr. Celi, the contact people who will direct you to the right teams are Jack Gallifant (jgally@mit.edu) and Cecile Chavane (chavane@mit.edu).

Project 1: Labrador: a novel transformer for numerical data

"Labrador" is an exciting project inspired by work published in Nature Medicine in December 2023.1 This project centers on advancing the development of a transformer specifically designed for numerical data, building upon the foundation laid by prior research utilizing laboratory values from the MIMIC database. The original work demonstrated promise with its open-source transformer model, although its advantages over traditional methods like XGBoost were somewhat limited. The goal of "Labrador" is to significantly enhance this model, focusing on creating better representations of laboratory data relationships. This enhancement will be particularly valuable for tasks such as the imputation of missing values and various prediction tasks in healthcare data analysis. Students involved in this project will have the opportunity to contribute to a vital area of medical AI research, working in partnership with industry and academic leaders to refine and expand the capabilities of AI in interpreting complex healthcare datasets. This project offers an educational experience in cutting-edge technology and the potential to make a tangible impact in the field of healthcare data analysis.
Bellamy DR, Kumar B, Wang C, Beam A. Labrador: Exploring the Limits of Masked Language Modeling for Laboratory Data [Internet]. arXiv; 2023 [cited 2023 Nov 29]. Available from: https://arxiv.org/abs/2312.11502

Project 2: Vector embeddings as a tool for cost-efficient & privacy-preserving alternative

Focusing on leveraging the power of vector embedding to efficiently share deidentified imaging data, specifically targeting the medical field in Low- and Middle-Income Countries (LMICs) like Uganda. This project, centered around the analysis of retinal images, aims to revolutionize how medical data is handled in resource-limited settings. Collaborating with multiple medical centers across Uganda, the team has collected thousands of retinal images, primarily for diabetic staging purposes. The project's crux involves applying advanced vector embedding techniques to convert these images into a compressed, information-rich format. This format is designed to retain essential diagnostic features and significantly reduce the computational resources required for processing and analysis. The project will rigorously evaluate this approach's computational efficiency and cost-effectiveness, a crucial aspect in the context of LMICs where such resources are often scarce. Additionally, the embedded data will be analyzed in conjunction with diabetic staging labels already linked to the images to assess the clinical utility and accuracy of these embeddings in medical diagnosis. The ultimate goal of this initiative is to demonstrate the practicality and advantages of vector embeddings in deidentified imaging data sharing, potentially leading to enhanced medical image analysis capabilities and improved healthcare outcomes in LMICs. This project represents an exciting intersection of advanced machine-learning techniques and global health, promising significant contributions to both fields.

Project 3: Satellite Imagery in Computer Science: Harnessing Real-time Data for Global Impact

The project focuses on harnessing satellite data from sources like SentinelHub and Google Earth Engine to overcome challenges such as noise and cloud interference in climate change impact predictions. Our project spans various applications, each critical to understanding and mitigating climate change impacts. We delve into monitoring deforestation, a key factor in global climate dynamics, and explore the use of satellite imagery in forecasting educational indices, shedding light on the socio-environmental interplay. Assessing water quality through satellite data provides insights into ecological and human health, while the prediction of climatic events via advanced data analysis stands as a crucial tool in disaster preparedness and mitigation.

Emphasis is placed on evaluating advanced methods like dimensionality reduction, vector embeddings, and continual learning to enhance the applicability and accuracy of this data in real-world scenarios. Students will delve into various applications, from monitoring deforestation and forecasting educational indices to assessing water quality and predicting climatic events, addressing key research questions about the efficacy of satellite imagery in real-time monitoring and climate-related prediction.

Project 4: Security Evaluation of Language Models in Web Interfaces

Language models have become integral to web interfaces, providing users with an interactive and intuitive experience. However, integrating these models into web environments raises concerns about security and privacy, as interactions are susceptible to tracking mechanisms such as cookies and type monitoring. This has important implications in healthcare, where protected health information may be present in other browser tabs or stored in clipboards. In addition, if information contained in browser histories, cookies, or past conversations alters language model outputs then this has important implications for patient diagnoses and treatments. This project aims to investigate the security of web interfaces that utilize language models, focusing on identifying potential vulnerabilities and proposing mitigation strategies.

This project has many angles, and students can tackle different components of this project and collaborate toward the ultimate goal of developing a comprehensive understanding of the security landscape of language models in web interfaces and proposing solutions that enhance privacy and data protection. By participating in this project, students will gain valuable insights into the security challenges associated with language models in web interfaces and contribute to developing more secure and privacy-conscious technologies.

Project 5: Examine the impact of NLP bias transfer into multimodal tasks

While both imaging and NLP tasks are laden with bias, integrating different modalities into one system may create and compound new, less easily understood biases. Given the increasing potential of language models to be used in radiology to report clinical imaging studies, there is an urgent safety need to understand how biases transfer and transform between models and data modalities. This project aims to evaluate the impact of different pretraining mixtures for imaging classifiers and language models to evaluate the downstream contributions to bias in model outputs in multimodal tasks. Students will leverage a new co-occurrence dataset mapping disease-demographic associations from 600 diseases in language model pretraining datasets (e.g., arxiv, Common Crawl, PubMed) and demographic information from MIMIC CXR to create such mixtures. Students will explore the impact of the pretraining mixtures as well as evaluate the malleability of each modality to ‘debiasing’ through fine-tuning approaches.

Project 6: Evaluating the Impact of Weak Supervision Biases on Strong Model Generalization

As machine learning models, particularly in the healthcare domain, become increasingly sophisticated, surpassing human-level performance and comprehension, the necessity for effective supervision mechanisms grows. OpenAI's recent concept of weak-to-strong generalization, where smaller, less capable models supervise larger, more capable ones, offers a promising avenue.1 However, this approach raises critical questions about biases inherent in weak supervision, particularly when using clinical outcomes and notes as labels, and their impact on the generalization capabilities of strong models. This project aims to investigate how biases in weak supervision affect the generalization performance of strong pre-trained models in the medical field. It focuses on understanding whether these models can generalize to solve complex problems with only weak, incomplete, or flawed training labels. Students will evaluate what weak errors impact generalization across subgroups of topics and how the structure of such biases impacts performance and alignment. By examining different types of weak label biases and their influence on the generalization capabilities of strong models, this project seeks to contribute to a deeper understanding of how biased labels impact machine learning, particularly in healthcare AI. It aims to shed light on the importance of errors in weak supervision and provide insights into developing more robust, unbiased AI systems in healthcare.
Burns et al., (2023). Weak-to-strong generalization: eliciting strong capabilities with weak supervision. https://cdn.openai.com/papers/weak-to-strong-generalization.pdf

Project 7: Development of a Comprehensive Claim Graph for Evaluating Clinical Assertions in Scholarly Literature

The exponential growth of biomedical literature presents a significant challenge in accurately tracking and validating the assertions made within these publications. This project aims to create a novel approach for systematically analyzing and cataloging claims made in scholarly articles, focusing on clinical evidence and drug-related research. Students will develop a sophisticated methodology to scan academic papers, identify cited claims, and assess the veracity and origin of these assertions.

A proof of concept has been developed as part of a recent paper to invalidate the claim that AUPRC is better in settings of class imbalance.1 This project will involve building on this pipeline and parsing through academic papers to extract citations and the associated claims linked to these references. The core of the work will be to verify claims in scientific literature related to clinical diagnosis and treatment. An essential component of this project is creating an unstructured database of 'claims', meticulously linking them to papers that endorse, reference, or assert these claims without proper attribution. This database will be a critical tool for researchers to trace the origin and evolution of specific clinical assertions and their underlying evidence.

A key area of interest for this project is the exploration of clinical topics like COVID-19, as we would be able to evaluate the full lifecycle of knowledge generation from the emergence of the condition to the current day. This would then be compared with gold standard systematic reviews and national guidelines. However, students are encouraged to suggest and explore other clinical domains where the 'claim graph' approach could yield insightful results. The project also opens up the possibility of exploring sub-claims and how they contribute to broader assertions, allowing for a more nuanced understanding of the interplay of ideas within the scientific community. This project will not only contribute to the integrity and transparency of scientific reporting but will also provide a valuable resource for researchers, clinicians, and policymakers in making informed decisions based on the most reliable and accurately attributed evidence.
Matthew B. A. McDermott, Lasse Hyldig Hansen, Haoran Zhang, Giovanni Angelotti, Jack Gallifant. A Closer Look at AUROC and AUPRC under Class Imbalance, https://arxiv.org/pdf/2401.06091.pdf

From Dr. Alexander Dale

These projects, from global startups selected and supported by MIT Solve, are not just research. The work the students do is likely to translate to real impact on real people, specifically people who are normally most underserved by health services.

The analysis completed by last year’s students with Solver team M’Care was presented to a local ministry of health in rural Nigeria, and led to expanded access to a nutrition program for underserved children.

To connect with specific projects, please get in touch with Noel Shaskan (Noel.Shaskan@solve.mit.edu).

Project 8: SAVE by Briota

Solution profile
Solution website

The Solution: An India-based team using a combination of hardware and software solutions to fill care gaps for rural and poor communities, where high costs and lack of healthcare specialists mean diagnosis of conditions can be incredibly difficult. Much of their work to date has been focused on Chronic Respiratory Diseases (COPD, ILD, Post TB, Long Covid, Asthma) but they’re expanding their work to include cardiometabolic diseases and common cancers like Lung, Breast and Oral. They have supported 55,000 patients to date.

The data: Briota has primary data around clinical breast examination using a simple to use lightsource device and OCR screening cards at the "screening" level. In these screening cards breast cancer potential anomalies are listed and classified. Secondary data about mammography outcomes for such patients should already be available, we will be preparing primary data based on our "screening camps" as well.

The Project: We have designed a breast health score algorithm, we would like to build prediction models where such breast health score is used for potential mammography requirement and also mammography outcome.

The team:

Founder - Gajanan Sakhare, a computer science student with over 2 decades of experience in the digital world - games, education and healthcare.
Medical Advisor - Rohini Patil, a breast cancer crusader, gynecologist with over 3 decades of experience

Project 9: Climate Health Vulnerability Mapping by Khushi Baby

Solution profile
Solution website

The Solution: Khushi Baby is another India-based project. They’ve served over 50M people -- mainly poor, rural women and children -- by building clinics, creating data-driven solutions for Community Health Workers, and working closely with local government to track health and deliver care.

The data: Khushi Baby has health data anonymized and aggregated at the village census tract level for some 50+ indicators across reproductive and child health, non communicable disease, and infectious diseases across Rajasthan for ~45M beneficiaries.

The Project: Khushi Baby is looking for support in this new effort to link their health data to climate data, so they can ultimately better predict outbreaks and other health impacts of climate change.

They are looking to assess statistical spatial associations and spatiotemporal trends between climatic data and these health outcome indicators. Climate data sets at the 5-10 km^2 resolution are being obtained from Google’s Environmental signal data. Specific associations of interest include land surface temperature, rainfall, vegetation, UV against malnutrition, anemia, vector borne disease; PM 2.5 against tb, chronic respiratory disease, and cardiovascular disease. More information can be found here.

We hope the class can help us both in implementing a pipeline for multi weighted geographic regression, but also in extracting relevant features from randomized satellite data on similar resolutions.

The Team: Sarfraz, Chief Data Scientist, at Khushi Baby has been with us over the last 7 years, and also is a data science mentor with Great Learning at MIT. He is also mentoring masters students in a new climate and health track at BITS Pilani in collaboration with JPAL.

Project 10: Rology

Solution profile
Solution website

The Solution: Rology is an Egypt-based team who work to fill health system gaps in multiple countries in the Middle East and North Africa that lack qualified radiologists and other healthcare providers. Their AI assisted platform matches each individual case with the most qualified available (remote) radiologist.

The Data: Rology possesses an extensive inventory of over 300 million interpreted DICOM images across various modalities, including CT, MRI, and X-ray, among others. This dataset is de-identified and anonymized, ensuring patient privacy while providing a valuable resource for research and development in medical imaging. Rology's data inventory, characterized by its diversity, regulatory-grade quality, and continuous growth, offers a unique opportunity for advancing AI-driven solutions in healthcare.

The Project: Rology seeks to delve into the efficiency, productivity, accuracy, and workflow improvements achievable with AI in medical imaging. They seek to answer key questions on how multimodal generative AI, LLMs, and vision language models VLMs can enhance the detection of anomalies and streamline reporting processes. We aim to quantify the impact of these technologies on identifying subtle diagnostic features, improving early detection rates, and ultimately, optimizing the radiology workflow to support better patient outcomes.

The Team:

CEO Amr Abodraiaa, noted for pioneering FDA-cleared AI-assisted diagnostics in teleradiology, bringing a wealth of leadership and innovation to the team;
Tech Lead Moemen Mohamed, with over 15 years of experience in medical imaging technologies, stands out for his adeptness in delivering complex solutions that enhance healthcare delivery;
Associate CEO Mahmoud Barakat, with a robust background in product management and a proven track record in driving AI advancements and strategic initiatives, offers critical insights into aligning technology with healthcare needs;
Tech Team Lead Sameh Ahmed, an assistant lecturer at Cairo University, brings an academic perspective to the practical application of AI in radiology, ensuring a bridge between theoretical knowledge and real-world application.

From Dr. Jason Qian

His suggestions are to examine interesting publicly available datasets and to define projects supported by those data. If you are interested in working with him, please contact him via email at zhiyu.qian.jason@gmail.com

Project 11: CDC publicly available datasets; possible focus on health disparities

These are public health/health behaviors focused, with limited details on specific disease/disease outcomes. These data are deidentified and covered by a generic IRB. We’ve done a series of projects looking at health disparities with these data, which could potentially be a match with Dr. Celi’s research interest.
- National Health Interview Survey (NHIS): interview surveys on various health behaviors: cancer screening, dental care, telehealth, asthma, cholesterol/blood pressure/aspirin etc.
- National Health and Nutrition Examination survey (NHANES): another longitudinal survey with more granular data on lab values, nutrition intake, and a few diseases: dermatology, immunization, kidney conditions, diabetes, mental health etc.
- Behavioral Risk Factor Surveillance System (BRFSS): largest CDC longitudinal surveys looking at health behaviors: smoking, mental health, cardiovascular health, preventive health etc. There are also specific modules measuring social determinants of health.
  - Data/dictionary:https://www.cdc.gov/brfss/about/index.htm

Project 12: National cancer database (NCDB)

Data from ~1500 nationally accredited cancer centers. Contains very well coded information on different types of cancers (e.g., Demographics, stage/grade at diagnosis, treatment received, outcomes). This dataset does not have lab values/ekg/vitals like MIMIC but is still a very robust dataset. I have the latest data on my computer for prostate and kidney cancer and am actively looking at treatment trends with it. These are de-identified data covered by a generic IRB. This data is available via application to all participating cancer centers: e.g., Dana Farber/Brigham and Women’s, Beth Israel, Mass General etc.

Data/dictionary: https://www.facs.org/quality-programs/cancer-programs/national-cancer-database/puf/
These are 2 examples of what Dr. Madhur Nyan did with this data last year:
- https://pubmed.ncbi.nlm.nih.gov/34465541/
- https://pubmed.ncbi.nlm.nih.gov/34529282/

Project 13: Longitudinal Cancer Data (SEER)

Surveillance, Epidemiology, and End Results Program (SEER): longitudinal cancer database maintained by NIH. It is publicly available but requires an application process (instead of directly downloadable like the CDC ones). It is similar to NCDB but is based on a different population. I’m yet to do a project on it myself but have access to it.

Data/dictionary:https://seer.cancer.gov/data-software/