In our lecture on February 22, you heard Drs. Celi and Dale
describe potential projects for our students to work on this
semester. In addition, Dr. Jason Qian, a urological surgeon at
Brigham & Women's (MGB) suggested additional possible projects
for which there are publicly available data. Below is a listing of
all these, and we will have a preference list for you to fill out
indicating which ones are of interest to you. Remember that if you
have an interesting problem and data set, you could also propose
that, but should include other students from the class.
"Labrador" is an exciting project inspired by work published in
Nature Medicine in December 2023.1 This project centers on
advancing the development of a transformer specifically designed
for numerical data, building upon the foundation laid by prior
research utilizing laboratory values from the MIMIC database. The
original work demonstrated promise with its open-source
transformer model, although its advantages over traditional
methods like XGBoost were somewhat limited. The goal of "Labrador"
is to significantly enhance this model, focusing on creating
better representations of laboratory data relationships. This
enhancement will be particularly valuable for tasks such as the
imputation of missing values and various prediction tasks in
healthcare data analysis. Students involved in this project will
have the opportunity to contribute to a vital area of medical AI
research, working in partnership with industry and academic
leaders to refine and expand the capabilities of AI in
interpreting complex healthcare datasets. This project offers an
educational experience in cutting-edge technology and the
potential to make a tangible impact in the field of healthcare
data analysis.
Bellamy DR, Kumar B, Wang C, Beam A. Labrador: Exploring the
Limits of Masked Language Modeling for Laboratory Data
[Internet]. arXiv; 2023 [cited 2023 Nov 29]. Available from: https://arxiv.org/abs/2312.11502
Focusing on leveraging the power of vector embedding to
efficiently share deidentified imaging data, specifically
targeting the medical field in Low- and Middle-Income Countries
(LMICs) like Uganda. This project, centered around the analysis of
retinal images, aims to revolutionize how medical data is handled
in resource-limited settings. Collaborating with multiple medical
centers across Uganda, the team has collected thousands of retinal
images, primarily for diabetic staging purposes. The project's
crux involves applying advanced vector embedding techniques to
convert these images into a compressed, information-rich format.
This format is designed to retain essential diagnostic features
and significantly reduce the computational resources required for
processing and analysis. The project will rigorously evaluate this
approach's computational efficiency and cost-effectiveness, a
crucial aspect in the context of LMICs where such resources are
often scarce. Additionally, the embedded data will be analyzed in
conjunction with diabetic staging labels already linked to the
images to assess the clinical utility and accuracy of these
embeddings in medical diagnosis. The ultimate goal of this
initiative is to demonstrate the practicality and advantages of
vector embeddings in deidentified imaging data sharing,
potentially leading to enhanced medical image analysis
capabilities and improved healthcare outcomes in LMICs. This
project represents an exciting intersection of advanced
machine-learning techniques and global health, promising
significant contributions to both fields.
The project focuses on harnessing satellite data from sources
like SentinelHub and Google Earth Engine to overcome challenges
such as noise and cloud interference in climate change impact
predictions. Our project spans various applications, each critical
to understanding and mitigating climate change impacts. We delve
into monitoring deforestation, a key factor in global climate
dynamics, and explore the use of satellite imagery in forecasting
educational indices, shedding light on the socio-environmental
interplay. Assessing water quality through satellite data provides
insights into ecological and human health, while the prediction of
climatic events via advanced data analysis stands as a crucial
tool in disaster preparedness and mitigation.
Emphasis is placed on evaluating advanced methods like
dimensionality reduction, vector embeddings, and continual
learning to enhance the applicability and accuracy of this data in
real-world scenarios. Students will delve into various
applications, from monitoring deforestation and forecasting
educational indices to assessing water quality and predicting
climatic events, addressing key research questions about the
efficacy of satellite imagery in real-time monitoring and
climate-related prediction.
Language models have become integral to web interfaces, providing users with an interactive and intuitive experience. However, integrating these models into web environments raises concerns about security and privacy, as interactions are susceptible to tracking mechanisms such as cookies and type monitoring. This has important implications in healthcare, where protected health information may be present in other browser tabs or stored in clipboards. In addition, if information contained in browser histories, cookies, or past conversations alters language model outputs then this has important implications for patient diagnoses and treatments. This project aims to investigate the security of web interfaces that utilize language models, focusing on identifying potential vulnerabilities and proposing mitigation strategies.
This project has many angles, and students can tackle different components of this project and collaborate toward the ultimate goal of developing a comprehensive understanding of the security landscape of language models in web interfaces and proposing solutions that enhance privacy and data protection. By participating in this project, students will gain valuable insights into the security challenges associated with language models in web interfaces and contribute to developing more secure and privacy-conscious technologies.
While both imaging and NLP tasks are laden with bias, integrating different modalities into one system may create and compound new, less easily understood biases. Given the increasing potential of language models to be used in radiology to report clinical imaging studies, there is an urgent safety need to understand how biases transfer and transform between models and data modalities. This project aims to evaluate the impact of different pretraining mixtures for imaging classifiers and language models to evaluate the downstream contributions to bias in model outputs in multimodal tasks. Students will leverage a new co-occurrence dataset mapping disease-demographic associations from 600 diseases in language model pretraining datasets (e.g., arxiv, Common Crawl, PubMed) and demographic information from MIMIC CXR to create such mixtures. Students will explore the impact of the pretraining mixtures as well as evaluate the malleability of each modality to ‘debiasing’ through fine-tuning approaches.
As machine learning models, particularly in the healthcare
domain, become increasingly sophisticated, surpassing human-level
performance and comprehension, the necessity for effective
supervision mechanisms grows. OpenAI's recent concept of
weak-to-strong generalization, where smaller, less capable models
supervise larger, more capable ones, offers a promising avenue.1
However, this approach raises critical questions about biases
inherent in weak supervision, particularly when using clinical
outcomes and notes as labels, and their impact on the
generalization capabilities of strong models. This project aims to
investigate how biases in weak supervision affect the
generalization performance of strong pre-trained models in the
medical field. It focuses on understanding whether these models
can generalize to solve complex problems with only weak,
incomplete, or flawed training labels. Students will
evaluate what weak errors impact generalization across subgroups
of topics and how the structure of such biases impacts performance
and alignment. By examining different types of weak label biases
and their influence on the generalization capabilities of strong
models, this project seeks to contribute to a deeper understanding
of how biased labels impact machine learning, particularly in
healthcare AI. It aims to shed light on the importance of errors
in weak supervision and provide insights into developing more
robust, unbiased AI systems in healthcare.
Burns et al., (2023). Weak-to-strong generalization: eliciting
strong capabilities with weak supervision. https://cdn.openai.com/papers/weak-to-strong-generalization.pdf
The exponential growth of biomedical literature presents a
significant challenge in accurately tracking and validating the
assertions made within these publications. This project aims to
create a novel approach for systematically analyzing and
cataloging claims made in scholarly articles, focusing on clinical
evidence and drug-related research. Students will develop a
sophisticated methodology to scan academic papers, identify cited
claims, and assess the veracity and origin of these assertions.
A proof of concept has been developed as part of a recent paper to
invalidate the claim that AUPRC is better in settings of class
imbalance.1 This project will involve building on this pipeline
and parsing through academic papers to extract citations and the
associated claims linked to these references. The core of the work
will be to verify claims in scientific literature related to
clinical diagnosis and treatment. An essential component of this
project is creating an unstructured database of 'claims',
meticulously linking them to papers that endorse, reference, or
assert these claims without proper attribution. This database will
be a critical tool for researchers to trace the origin and
evolution of specific clinical assertions and their underlying
evidence.
A key area of interest for this project is the exploration of
clinical topics like COVID-19, as we would be able to evaluate the
full lifecycle of knowledge generation from the emergence of the
condition to the current day. This would then be compared with
gold standard systematic reviews and national guidelines. However,
students are encouraged to suggest and explore other clinical
domains where the 'claim graph' approach could yield insightful
results. The project also opens up the possibility of exploring
sub-claims and how they contribute to broader assertions, allowing
for a more nuanced understanding of the interplay of ideas within
the scientific community. This project will not only contribute to
the integrity and transparency of scientific reporting but will
also provide a valuable resource for researchers, clinicians, and
policymakers in making informed decisions based on the most
reliable and accurately attributed evidence.
Matthew B. A. McDermott, Lasse Hyldig Hansen, Haoran Zhang,
Giovanni Angelotti, Jack Gallifant. A Closer Look at AUROC and
AUPRC under Class Imbalance, https://arxiv.org/pdf/2401.06091.pdf
These projects, from global startups selected and supported by MIT Solve, are not just research. The work the students do is likely to translate to real impact on real people, specifically people who are normally most underserved by health services.
The analysis completed by last year’s students with Solver team M’Care
was presented to a local ministry of health in rural Nigeria, and
led to expanded access to a nutrition program for underserved
children.
To connect with specific projects, please get in touch with
Noel Shaskan (Noel.Shaskan@solve.mit.edu).
Solution
profile
Solution website
The Solution: An India-based team using a combination of hardware and software solutions to fill care gaps for rural and poor communities, where high costs and lack of healthcare specialists mean diagnosis of conditions can be incredibly difficult. Much of their work to date has been focused on Chronic Respiratory Diseases (COPD, ILD, Post TB, Long Covid, Asthma) but they’re expanding their work to include cardiometabolic diseases and common cancers like Lung, Breast and Oral. They have supported 55,000 patients to date.
The data: Briota has primary data around clinical breast examination using a simple to use lightsource device and OCR screening cards at the "screening" level. In these screening cards breast cancer potential anomalies are listed and classified. Secondary data about mammography outcomes for such patients should already be available, we will be preparing primary data based on our "screening camps" as well.
The Project: We have designed a breast health score algorithm, we would like to build prediction models where such breast health score is used for potential mammography requirement and also mammography outcome.
The team:
Solution
profile
Solution website
The Solution: Khushi Baby is another India-based project. They’ve served over 50M people -- mainly poor, rural women and children -- by building clinics, creating data-driven solutions for Community Health Workers, and working closely with local government to track health and deliver care.
The data: Khushi Baby has health data anonymized and aggregated at the village census tract level for some 50+ indicators across reproductive and child health, non communicable disease, and infectious diseases across Rajasthan for ~45M beneficiaries.
The Project: Khushi Baby is looking for support in this new effort to link their health data to climate data, so they can ultimately better predict outbreaks and other health impacts of climate change.
They are looking to assess statistical spatial associations and spatiotemporal trends between climatic data and these health outcome indicators. Climate data sets at the 5-10 km^2 resolution are being obtained from Google’s Environmental signal data. Specific associations of interest include land surface temperature, rainfall, vegetation, UV against malnutrition, anemia, vector borne disease; PM 2.5 against tb, chronic respiratory disease, and cardiovascular disease. More information can be found here.
We hope the class can help us both in implementing a pipeline for multi weighted geographic regression, but also in extracting relevant features from randomized satellite data on similar resolutions.
The Team: Sarfraz, Chief Data Scientist, at Khushi Baby has been with us over the last 7 years, and also is a data science mentor with Great Learning at MIT. He is also mentoring masters students in a new climate and health track at BITS Pilani in collaboration with JPAL.
Solution
profile
Solution website
The Solution: Rology is an Egypt-based team who work to fill health system gaps in multiple countries in the Middle East and North Africa that lack qualified radiologists and other healthcare providers. Their AI assisted platform matches each individual case with the most qualified available (remote) radiologist.
The Data: Rology possesses an extensive inventory of over 300 million interpreted DICOM images across various modalities, including CT, MRI, and X-ray, among others. This dataset is de-identified and anonymized, ensuring patient privacy while providing a valuable resource for research and development in medical imaging. Rology's data inventory, characterized by its diversity, regulatory-grade quality, and continuous growth, offers a unique opportunity for advancing AI-driven solutions in healthcare.The Project: Rology seeks to delve into the efficiency, productivity, accuracy, and workflow improvements achievable with AI in medical imaging. They seek to answer key questions on how multimodal generative AI, LLMs, and vision language models VLMs can enhance the detection of anomalies and streamline reporting processes. We aim to quantify the impact of these technologies on identifying subtle diagnostic features, improving early detection rates, and ultimately, optimizing the radiology workflow to support better patient outcomes.
The Team:
His suggestions are to examine interesting publicly available datasets and to define projects supported by those data. If you are interested in working with him, please contact him via email at zhiyu.qian.jason@gmail.com