Millicent Li

li.mil at northeastern dot edu

I'm a PhD student at Northeastern University where I'm advised by Byron Wallace. Broadly, I'm interested in investigating language model behaviors. More specifically, language models can precisely generalize on unseen data or pick up on behaviors even though we don't explicitly teach them these behaviors. Where do these abilities come from, and why are models predisposed to pick up on these cues? To this end, I use various tools, in training dynamics and interpretability, to measure and reason about how these behaviors arise. I'm grateful to be supported by a Khoury PhD Fellowship and the NSF GRFP.

Before my PhD, I spent time at FAIR/Meta AI as an AI Resident, working with Marjan Ghazvininejad and Mike Lewis, and at Microsoft Research, working with Tristan Naumann. And even before this, I was an undergrad at the University of Washington working with Shwetak Patel on ubiquitous computing and Noah Smith on natural language processing.

Links: CV

News

January 2025

Our paper, Multi-Field Adaptive Retrieval, done during my internship at Microsoft Semantic Machines, was accepted to ICLR 2025 as a spotlight (top 5%). Thanks to my amazing co-authors!

October 2024

New preprint on our paper, Multi-Field Adaptive Retrieval, done during my internship at Microsoft Semantic Machines!

August 2024

We've released a new preprint on causal interpretability, The Quest for the Right Mediator: A History, Survey, and Theoretical Grounding of Causal Interpretability - work done with David Bau's interpretability group.

February 2024

I've accepted an internship offer with Microsoft Semantic Machines for the upcoming summer, working with Patrick Xia and Tongfei Chen.

January 2024

Our paper, Function Vectors in Large Language Models, was accepted to ICLR 2024!

May 2023

Our paper, Summarizing, Simplifying, and Synthesizing Medical Evidence using GPT-3 (with Varying Success), was accepted to ACL 2023!

April 2022

I was awarded a 2022 NSF Graduate Research Fellowship. Northeastern wrote an article about it here.

August 2021

Started as an AI Resident with Fundamental AI Research (FAIR) at Meta in Seattle, working on natural language processing and human-computer interaction research for a year.

May 2021

Started my internship at Microsoft Research working with Tristan Naumann on the intersection of natural language processing and healthcare!

April 2021

Excited to announce that I’ll be starting my PhD in the Khoury College of Computer Sciences at Northeastern University in Boston, fall of 2022. Thanks to everyone who has supported me on this journey thus far!

March 2021

I was awarded an Honorable Mention for the 2021 NSF Graduate Research Fellowship competition.

Publications

2025

Multi-Field Adaptive Retrieval

Millicent Li, Tongfei Chen, Benjamin Van Durme, Patrick Xia

International Conference on Learning Representations (ICLR), 2025
Spotlight, Top 5%

HTML

Document retrieval for tasks such as search and retrieval-augmented generation typically involves datasets that are unstructured: free-form text without explicit internal structure in each document. However, documents can have a structured form, consisting of fields such as an article title, message body, or HTML header. To address this gap, we introduce Multi-Field Adaptive Retrieval (MFAR), a flexible framework that accommodates any number of and any type of document indices on structured data. Our framework consists of two main steps: (1) the decomposition of an existing document into fields, each indexed independently through dense and lexical methods, and (2) learning a model which adaptively predicts the importance of a field by conditioning on the document query, allowing on-the-fly weighting of the most likely field(s). We find that our approach allows for the optimized use of dense versus lexical representations across field types, significantly improves in document ranking over a number of existing retrievers, and achieves state-of-the-art performance for multi-field structured data.

2024

The Quest for the Right Mediator: A History, Survey, and Theoretical Grounding of Causal Interpretability

Aaron Mueller, ... Millicent Li ... Yonatan Belinkov

ArXiv

HTML

Interpretability provides a toolset for understanding how and why neural networks behave in certain ways. However, there is little unity in the field: most studies employ ad-hoc evaluations and do not share theoretical foundations, making it difficult to measure progress and compare the pros and cons of different techniques. Furthermore, while mechanistic understanding is frequently discussed, the basic causal units underlying these mechanisms are often not explicitly defined. In this paper, we propose a perspective on interpretability research grounded in causal mediation analysis. Specifically, we describe the history and current state of interpretability taxonomized according to the types of causal units (mediators) employed, as well as methods used to search over mediators. We discuss the pros and cons of each mediator, providing insights as to when particular kinds of mediators and search methods are most appropriate depending on the goals of a given study. We argue that this framing yields a more cohesive narrative of the field, as well as actionable insights for future work. Specifically, we recommend a focus on discovering new mediators with better trade-offs between human-interpretability and compute-efficiency, and which can uncover more sophisticated abstractions from neural networks than the primarily linear mediators employed in current work. We also argue for more standardized evaluations that enable principled comparisons across mediator types, such that we can better understand when particular causal units are better suited to particular use cases.
Function Vectors in Large Language Models

Eric Todd, Millicent Li, Arnab Sen Sharma, Aaron Mueller, Byron C Wallace, David Bau

International Conference on Learning Representations (ICLR), 2024

HTML

We report the presence of a simple mechanism that represents an input-output function as a vector within autoregressive transformer language models. Using causal mediation analysis on a diverse range of in-context-learning (ICL) tasks, we find that a small number attention heads transport a compact representation of the demonstrated task, which we call a function vector (FV). We test the causal effects of FVs in a variety of input contexts and find that for many tasks FVs are robust to changes in context, i.e., they trigger execution of the task on inputs such as zero-shot and natural text settings that do not resemble ICL. By measuring the causal effects of the FV at each layer of the network, we find that FVs do not directly perform a task through embedding arithmetic, but rather they trigger the model to perform the task using potentially nonlinear computations. Finally, we investigate the internal structure of FVs and find while that they contain information that directly encodes the output space of the function, this information alone is not sufficient to reconstruct an FV. Taken together, our findings suggest that LLMs contain internal abstractions of general-purpose functions that can be invoked in a variety of contexts.

2023

Summarizing, Simplifying, and Synthesizing Medical Evidence using GPT-3 (with Varying Success)

Chantal Shaib, Millicent L. Li, Sebastian Joseph, Iain Marshall, Junyi Jessy Li, Byron C. Wallace

Annual Meeting of the Association for Computational Linguistics (ACL), 2023

HTML

Large language models, particularly GPT-3, are able to produce high quality summaries of general domain news articles in few- and zero-shot settings. However, it is unclear if such models are similarly capable in more specialized, high-stakes domains such as biomedicine. In this paper, we enlist domain experts (individuals with medical training) to evaluate summaries of biomedical articles generated by GPT-3, given zero supervision. We consider both single- and multi-document settings. In the former, GPT-3 is tasked with generating regular and plain-language summaries of articles describing randomized controlled trials; in the latter, we assess the degree to which GPT-3 is able to \emph{synthesize} evidence reported across a collection of articles. We design an annotation scheme for evaluating model outputs, with an emphasis on assessing the factual accuracy of generated summaries. We find that while GPT-3 is able to summarize and simplify single biomedical articles faithfully, it struggles to provide accurate aggregations of findings over multiple documents. We release all data and annotations used in this work.

2022

A Review on Language Models as Knowledge Bases

Badr AlKhamissi*, Millicent Li*, Asli Celikyilmaz^, Mona Diab^, Marjan Ghazvininejad^

arXiv

* denotes equal contribution

^ denotes equal supervision

HTML Project Website

Recently, there has been a surge of interest in the NLP community on the use of pretrained Language Models (LMs) as Knowledge Bases (KBs). Researchers have shown that LMs trained on a sufficiently large (web) corpus will encode a significant amount of knowledge implicitly in its parameters. The resulting LM can be probed for different kinds of knowledge and thus acting as a KB. This has a major advantage over traditional KBs in that this method requires no human supervision. In this paper, we present a set of aspects that we deem a LM should have to fully act as a KB, and review the recent literature with respect to those aspects.

2020

Multi-Channel Facial Photoplethysmography Sensing

Parker S. Ruth, Jerry Cao, Millicent Li, Jacob E. Sunshine, Edward J. Wang, and Shwetak N. Patel

International Conference of the IEEE Engineering in Medicine Biology Society (EMBC 2020)

HTML

Motivated by the need for continuous cardiovascular monitoring, we present a system for performing photoplethysmography sensing at multiple facial locations. As a proof-of-concept, our system incorporates an optical sensor array into a wearable face mask form factor for application in a surgical hemodynamic monitoring use case. Here we demonstrate that our design can accurately detect pulse timing by validating estimated heart rate against ground truth electrocardiogram recordings. In an experiment across 10 experimental subjects, our system achieves an error standard deviation of 2.84 beats per minute. This system shows promise for performing non-invasive, continuous pulse waveform recording from multiple locations on the face.