Publications
Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.
Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.
Sort By
1 - 15 of 10822 publications
Preview abstract
For many practical applications of quantum computing, the slowest and most costly steps involve coherently accessing classical data. We help address this challenge by applying mass production techniques, which can sometimes allow us to perform operations many times in parallel for a cost that is comparable to a single execution[1-3]. We combine existing mass-production results with modern approaches for loading classical data using ``quantum read-only memory.'' We show that quantum mass production techniques offer no benefit when we consider a cost model that focuses purely on the number of non-Clifford gates. However, analyzing the constant factors in a more nuanced cost model, we find that it may be possible to obtain a reduction in cost of an order or magnitude or more for a variety reasonably-sized fault-tolerant quantum algorithms. We present several applications of quantum mass-production techniques beyond naive parallelization, including a strategy for reducing the cost of serial calls to the same data loading step.
View details
mmMUSE: An mmWave-based Motion-resilient Universal Speech Enhancement System
Chenming He
Yanyong Zhang
Kai Wang
Dequan Wang
Lingyu Wang
the Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies (IMWUT), ACM (2026) (to appear)
Preview abstract
Voice-based smart systems can greatly enhance user experiences by allowing higher-quality interactions through better voice perception. Speech enhancement can benefit such systems by isolating noise from speech. Recently, integrating millimeter-wave (mmWave) with audio for speech perception has gained increasing attention due to microphones' limitations in noisy environments. However, mmWave-based vocal extraction is severely affected by motion, which disperses vocal signals across ranges and introduces distortions. In this paper, we propose an mmWave-based motion-resilient universal speech enhancement system called mmMUSE, which fuses mmWave and audio signals. To mitigate motion interference, we develop a Doppler-based method for motion-robust vocal signal extraction. Moreover, by introducing the Vocal-Noise-Ratio metric to assess the prominence of vocal signals from mmWave, we achieve real-time voice activity detection that gains 3.81 dB of SISDR in noisy speeches. Additionally, we design a two-stage complex-valued network that includes an attention-based fusion network for cross-modal complementing and a time-frequency masking network for correcting amplitude and phase of speech to isolate noises.
Using mmWave and audio datasets from 46 participants, mmMUSE outperforms the state-of-the-art speech enhancement models, achieving an average SISDR improvement of 3.12 dB. Additionally, mmMUSE achieves SISDR improvements of 16.51 dB, 17.93 dB, 14.93 dB, and 18.95 dB in controlled environments involving intense noise, extensive motion, multiple speakers, and various obstructive materials, respectively. Finally, we evaluate mmMUSE in real-world scenarios including running, public spaces, and driving, maintaining a word error rate (WER) below 10%.
View details
Preview abstract
AI coding assistants are rapidly becoming integral to modern software development. A key challenge in this space is the continual need to migrate and modernize codebases in response to evolving software ecosystems. Traditionally, such migrations have relied on rule-based systems and human intervention. With the advent of powerful large language models (LLMs), AI-driven agentic frameworks offer a promising alternative—but their effectiveness remains underexplored. In this paper, we introduce FreshBrew, a novel benchmark for evaluating AI-based agentic frameworks on project-level Java migrations. We benchmark several such frameworks, powered by state-of-the-art LLMs, and compare their performance against established rule-based tools. Our evaluation of AI agents on this benchmark of 228 repositories shows that the top-performing model, Gemini 2.5 Flash, can successfully migrate 56.5% of projects to JDK 17. Our empirical analysis reveals novel insights into the critical strengths and limitations of current agentic approaches, offering actionable insights into their real-world applicability. By releasing FreshBrew publicly upon acceptance, we aim to facilitate rigorous, reproducible evaluation and catalyze progress in AI-driven codebase modernization.
View details
Preview abstract
Browser fingerprinting is an online tracking technique that is being increasingly adopted for profiling and ad targeting purposes. While prior work has analyzed the prevalence and impact of browser fingerprinting on the Web, they have traditionally relied on large-scale automated crawls. Naturally, these cannot replicate real-human interactions, e.g., solve CAPTCHAs, evade bot detectors, or operate behind login pages and paywalls. This prompts the question as to whether or not the fingerprinting ecosystem is appreciably different in real-world browsing sessions. In this paper, we begin to address this question by designing and conducting a user study aimed at collecting actual telemetry data from real browsing sessions of 30 users.
We find that almost half of the fingerprinting websites identified from real user browsing sessions are missed by equivalent automated crawls. This is mainly due to the inability of automated crawls to identify and visit authentication pages, being blocked by bot detectors, and/or failing to perform user interactions that specifically trigger browser fingerprinting scripts. We also find new fingerprinting vectors that are consistently present in fingerprinting scripts captured by real user browsing sessions yet missing from automated crawls. Finally, we assess the feasibility of collecting fingerprinting training data in a privacy-preserving way. We conclude that private models built on real user browsing sessions can detect browser fingerprinting more effectively than models trained on automated crawls alone, while simultaneously providing strong privacy guarantees to users.
View details
Synthesizing Privacy-Preserving Text Data via Finetuning without Finetuning Billion-Scale LLMs
Bowen Tan
Zheng Xu
Eric Xing
Zhiting Hu
International Conference on Machine Learning (ICML) (2025)
Preview abstract
Synthetic data offers a promising path to train models while preserving data privacy. Differentially private (DP) finetuning of large language models (LLMs) as data generator is effective, but is impractical when computation resources are limited. Meanwhile, prompt-based methods such as private evolution depend heavily on the manual prompts, and ineffectively use private information in their iterative data selection process. To overcome these limitations, we propose CTCL (Data Synthesis with ConTrollability and CLustering), a novel framework for generating privacy-preserving synthetic data without extensive prompt engineering or billion-scale LLM finetuning. CTCL pretrains a lightweight 140M conditional generator and a clustering-based topic model on large-scale public data. To further adapt to the private domain, the generator is DP finetuned on private data for fine-grained textual information, while the topic model extracts a DP histogram representing distributional information. The DP generator then samples according to the DP histogram to synthesize a desired number of data examples. Evaluation across five diverse domains demonstrates the effectiveness of our framework, particularly in the strong privacy regime. Systematic ablation validates the design of each framework component and highlights the scalability of our approach.
View details
Fast ACS: Low-Latency File-Based Ordered Message Delivery at Scale
Anil Raghunath Iyer
Neel Bagora
Chang Yu
Olivier Pomerleau
Vivek Kumar
Prunthaban Kanthakumar
Usenix Annual Technical Conference (2025)
Preview abstract
Low-latency message delivery is crucial for real-time systems. Data originating from a producer must be delivered to consumers, potentially distributed in clusters across metropolitan and continental boundaries. With the growing scale of computing, there can be several thousand consumers of the data. Such systems require a robust messaging system capable of transmitting messages containing data across clusters and efficiently delivering them to consumers. The system must offer guarantees like ordering and at-least-once delivery while avoiding overload on consumers, allowing them to consume messages at their own pace.
This paper presents the design of Fast ACS (an abbreviation for Ads Copy Service), a file-based ordered message delivery system that leverages a combination of two-sided (inter-cluster) and one-sided (intra-cluster) communication primitives—namely, Remote Procedure Call and Remote Direct Memory Access, respectively—to deliver messages. The system has been successfully deployed to dozens of production clusters and scales to accommodate several thousand consumers within each cluster, which amounts to Tbps-scale intra-cluster consumer traffic at peak. Notably, Fast ACS delivers messages to consumers across the globe within a few seconds or even sub-seconds (p99) based on the message volume and consumer scale, at a low resource cost.
View details
Preview abstract
We study the existence of almost fair and near-optimal solutions to a routing problem as defined in the seminal work of Rosenthal. We focus on the setting where multiple alternative routes are available for each potential request (which corresponds to a potential user of the network). This model captures a collection of diverse applications such as packet routing in communication networks, routing in road networks with multiple alternative routes, and the economics of transportation of goods.
Our recommended routes have provable guarantees in terms of both the total cost and fairness concepts such as approximate envy-freeness. We employ and appropriately combine tools from algorithmic game theory and fair division. Our results apply on two distinct models: the splittable case where the request is split among the selected paths (e.g., routing a fleet of trucks) and the unsplittable case where the request is assigned to one of its designated paths (e.g., a single user request). Finally, we conduct an empirical analysis to test the performance of our approach against simpler baselines using the real world road network of New York City.
View details
Beyond Touchscreens: Dynamic and Multimodal Interaction Needs
Melissa Barnhart Wantland
Mai Kobori
Universal Access in Human-Computer Interaction, Springer-Verlag (2025)
Preview abstract
Today’s smartphone interactions are typically designed with one primary preset, accompanied by customization settings that can be manually adjusted. To promote the creation of contextually aware experiences, researchers have highlighted the factors that influence mobile device usage in the ability-based design framework. This paper expands upon existing frameworks and contributes to an empirical understanding of smartphone accessibility. Through a 10-day longitudinal diary study and video interview with 24 individuals who do and do not identify as having a disability, the research also illustrates the reactions of reattempt, adaptation, and avoidance, which were used in response to a lack of smartphone accessibility. Despite experiencing scenarios where accessibility settings could be leveraged, 20 out of 24 participants did not use accessibility settings on their smartphone. A total of 12 out of 24 participants tried accessibility settings on their smartphones, however identifying accessibility was not for them. This work highlights the need to shift current design practices to better serve the accessibility community.
View details
On Fisher Consistency of Surrogate Losses for Optimal Dynamic Treatment Regimes with Multiple Categorical Treatments per Stage
Nilanjana Laha
Nilson Chapagain
Victoria Cicherski
Aarón Sonabend
arxiv (2025) (to appear)
Preview abstract
Patients with chronic diseases often receive treatments at multiple time points, or stages. Our goal is to learn the optimal dynamic treatment regime (DTR) from longitudinal patient data. When both the number of stages and the number of treatment levels per stage are arbitrary, estimating the optimal DTR reduces to a sequential, weighted, multiclass classification problem (Kosorok and Laber, 2019). In this paper, we aim to solve this classification problem simultaneously across all stages using Fisher consistent surrogate losses. Although computationally feasible Fisher consistent surrogates exist in special cases, e.g., the binary treatment setting, a unified theory of Fisher consistency remains largely unexplored. We establish necessary and sufficient conditions for DTR Fisher consistency within the class of non-negative, stagewise separable surrogate losses. To our knowledge, this is the first result in the DTR literature to provide necessary conditions for Fisher consistency within a non-trivial surrogate class. Furthermore, we show that many convex surrogate losses fail to be Fisher consistent for the DTR classification problem, and we formally establish this inconsistency for smooth, permutation equivariant, and relative-margin-based convex losses. Building on this, we propose SDSS (Simultaneous Direct Search with Surrogates), which uses smooth, non-concave surrogate losses to learn the optimal DTR. We develop a computationally efficient, gradient-based algorithm for SDSS. When the optimization error is small, we establish a sharp upper bound on SDSS's regret decay rate. We evaluate the numerical performance of SDSS through simulations and demonstrate its real-world applicability by estimating optimal fluid resuscitation strategies for severe septic patients using electronic health record data.
View details
Escaping Collapse: The Strength of Weak Data for Large Language Model Training
Sara Babakniya
Advances in Neural Information Processing Systems 38 (NeurIPS 2025)
Preview abstract
Synthetically-generated data plays an increasingly larger role in training large language models. However, while synthetic data has been found to be useful, studies have also shown that without proper curation it can cause LLM performance to plateau, or even "collapse", after many training iterations. In this paper, we formalize this question and develop a theoretical framework to investigate how much curation is needed in order to ensure that LLM performance continually improves. Our analysis is inspired by boosting, a classic machine
learning technique that leverages a very weak learning algorithm to produce an arbitrarily good classifier. The approach we analyze subsumes many recently proposed methods for training LLMs on synthetic data, and thus our analysis sheds light on why they are successful, and also suggests opportunities for future improvement. We present experiments that validate our theory, and show that dynamically focusing labeling resources on the most challenging examples --- in much the same way that boosting focuses the efforts of the weak learner --- leads to improved performance.
View details
Preview abstract
Hashing is a fundamental operation in various computer sci-
ence applications. Despite the prevalence of specific key
formats like social security numbers, MAC addresses, plate
numbers, and URLs, hashing libraries typically treat them as
general byte sequences. This paper introduces a technique
for synthesizing specialized hash functions tailored to par-
ticular byte formats. The proposed code generation method
leverages three prevalent patterns: (i) fixed-length keys, (ii)
keys with common subsequences, and (iii) keys ranging on
predetermined sequences of bytes. The code generation pro-
cess involves two algorithms: one identifies relevant regular
expressions within key examples, and the other generates
specialized hash functions based on these expressions. This
approach, straightforward to implement, showcases improve-
ments over highly optimized hash function implementations.
Comparative analysis demonstrates that our synthetic func-
tions outperform counterparts in the C++ Standard Template
Library and the Google Abseil Library, achieving speedups
ranging from 2% to 11%, depending on the key format.
View details
Score-based Causal Representation Learning: Linear and General Transformations
Burak Varici
Emre Acarturk
Abhishek Kumar
Ali Tajer
Journal of Machine Learning Research (JMLR) 2025 (2025)
Preview abstract
This paper addresses intervention-based causal representation learning (CRL) under a
general nonparametric latent causal model and an unknown transformation that maps the
latent variables to the observed variables. Linear and general transformations are investigated. The paper addresses both the identifiability and achievability aspects. Identifiability
refers to determining algorithm-agnostic conditions that ensure the recovery of the true
latent causal variables and the underlying latent causal graph. Achievability refers to the algorithmic aspects and addresses designing algorithms that achieve identifiability guarantees.
By drawing novel connections between score functions (i.e., the gradients of the logarithm of
density functions) and CRL, this paper designs a score-based class of algorithms that ensures
both identifiability and achievability. First, the paper focuses on linear transformations and
shows that one stochastic hard intervention per node suffices to guarantee identifiability. It
also provides partial identifiability guarantees for soft interventions, including identifiability
up to mixing with parents for general causal models and perfect recovery of the latent graph
for sufficiently nonlinear causal models. Secondly, it focuses on general transformations
and demonstrates that two stochastic hard interventions per node are sufficient for identifiability. This is achieved by defining a differentiable loss function whose global optima
ensure identifiability for general CRL. Notably, one does not need to know which pair of
interventional environments has the same node intervened. Finally, the theoretical results
are empirically validated via experiments on structured synthetic data and image data.
View details
Preview abstract
We initiate a study of algorithms for model training with user-level differential privacy (DP), where each example is associated with multiple users, which we call the multi-attribution model. We first provide a carefully chosen definition of user-level DP under the multi-attribution model. Next we study the contribution bounding problem, i.e. the problem of selecting a subset of the dataset for which each user is associated with a limited number of examples. We propose a greedy baseline algorithm for the contribution bounding algorithm. We then study this algorithm for a synthetic logistic regression task and a transformer training task, including studying a number of variants of this baseline algorithm that to optimize the subset chosen in various ways. We find that the baseline algorithm remains competitive with its variants in most settings, and build a better understanding of the practical importance of the bias-variance tradeoff inherent in the contribution bounding problem.
View details
Development and Evaluation of ML Models for Cardiotocography Interpretation
Nicole Chiou
Nichole Young-Lin
Abdoulaye Diack
Christopher Kelly
Sanmi Koyejo
NPJ Women's Health (2025)
Preview abstract
The inherent variability in the visual interpretation of cardiotocograms (CTGs) by obstetric clinical experts, both intra- and inter-observer, presents a substantial challenge in obstetric care. In response, we investigate automated CTG interpretation as a potential solution to enhance the early detection of fetal hypoxia during labor, thereby reducing unnecessary operative interventions and improving overall maternal and neonatal care. This study employs deep learning techniques to reduce the subjectivity associated with visual CTG interpretation. Our results demonstrate that employing objective cord blood pH measurements, rather than clinician-defined Apgar scores, yields more consistent and robust model performance. Additionally, through a series of ablation studies, we investigate the impact of temporal distribution shifts on the performance of these deep learning models. We examine tradeoffs between performance and fairness, specifically evaluating performance across demographic and clinical subgroups. Finally, we discuss the practical implications of our findings for the real-world deployment of such systems, emphasizing their potential utility in medical settings with limited resources.
View details
On-the-Fly OVD Adaptation with FLAME: Few-shot Localization via Active Marginal-Samples Exploration
Yehonathan Refael
Amit Aides
Aviad Barzilai
Vered Silverman
Bolous Jaber
(2025)
Preview abstract
Open-vocabulary object detection (OVD) models offer remarkable flexibility applications by enabling object detection from arbitrary text queries. Still, the zero-shot performance of the pre-trained models is hampered by the inherent semantic ambiguity of natural language, result to low precision, leading to insufficient crucial downstream applications. For instance, in the remote sensing (RS) domain, a query for "ship" can yield varied and contextually irrelevant results. To address this, for real time applications, we propose a novel cascaded architecture that synergizes the broad capabilities of a large, pre-trained OVD model with a lightweight, few-shot classifier. Our approach utilizes the frozen weights of the zero-shot model to generate initial, high-recall object-embedding proposals, which are then refined by a compact classifier trained in real-time on a handful of user-annotated examples. The core of our contribution is an efficient one step active learning strategy for selecting the most informative samples for user annotation. Our method identifies (extremely) small amount of an uncertain candidates near the theoretical decision boundary using density estimation and then applies clustering to ensure a diverse training set. This targeted sampling enables our cascaded system to elevate performance on standard remote sensing benchmarks. Our work thus presents a practical and resource-efficient framework for adapting foundational models to specific user needs, drastically reducing annotation overhead while achieving high accuracy without costly full-model fine-tuning.
View details