| CARVIEW |
Han Zhao | 赵晗
Assistant Professor
Department of Computer Science
Department of Electrical and Computer Engineering (affiliated)
University of Illinois Urbana-Champaign
hanzhao [AT] illinois (DOT) edu
3320 Siebel Center, 201 N Goodwin Ave Urbana, IL, 61801
About Me
I am an assistant professor at the Department of Computer Science, University of Illinois Urbana-Champaign, affiliated with the Department of Electrical and Computer Engineering. I am also an Amazon scholar at Amazon AI and Search Science.
Before joining UIUC, I was a machine learning researcher at D. E. Shaw & Co. I obtained my Ph.D. from the Machine Learning Department, Carnegie Mellon University. Previously, I obtained my BEng degree from the Computer Science Department at Tsinghua University and MMath from the University of Waterloo.
I have a broad interest in trustworthy machine learning. In particular, I work on transfer learning (domain adaptation/generalization/distributional robustness, multitask/meta-learning), algorithmic fairness, probabilistic circuits, and their applications in natural language, vision and quantitative finance. My long-term goal is to build trustworthy ML systems that are efficient, robust, fair, and interpretable.
Acknowledgments
Our group's research has been generously supported by Google Research, Meta AI, Amazon AI, Nvidia, IBM Research, the National Science Foundation (NSF), and the Defense Advanced Research Projects Agency (DARPA). Thank you!
Prospective students
For PhD applicants: Thank you for your interest! I am not recruiting any new PhD students in Fall' 26. If you are interested in our CS PhD program, please apply to the UIUC CS graduate program. There is no need to directly contact me regarding PhD admissions as it will be handled by the admission committee.
For undergraduate/MS students at UIUC: Please fill out this Google form. Your chance of getting involved is higher if more of the followings are true: you have a high GPA; you did quite well on courses related to math, statistics, and/or machine learning; you are able to commit 12+ hours per week on research; you have strong programming skills.
Publications
[
show
selected
/
show by date
]
A Snapshot of Influence: A Local Data Attribution Framework for Online Reinforcement
Learning
Y. Hu, F. Wu, H. Ye, D. Forsyth, J. Zou, N. Jiang, J. W. Ma, H. Zhao
In Proceedings of the 39th Advances in Neural Information Processing Systems (NeurIPS
2025, Oral)
[abs]
[pdf]
Online reinforcement learning (RL) excels in complex, safety-critical domains, yet it faces
challenges such as sample inefficiency, training instability, and a lack of
interpretability. Data attribution offers a principled way to trace model behavior back to
individual training samples. However, in online RL, each training sample not only drives
policy updates but also influences future data collection, violating the fixed dataset
assumption in existing attribution methods. In this paper, we initiate the study of data
attribution for online RL, focusing on the widely used Proximal Policy Optimization (PPO)
algorithm. We start by establishing a local attribution framework, interpreting model
checkpoints with respect to the records in the recent training buffer. We design two target
functions, capturing agent action and cumulative return respectively, and measure each
record's contribution through gradient similarity between its training loss and these
targets. We demonstrate the power of this framework through three concrete applications:
diagnosis of learning, temporal analysis of behavior formation, and targeted intervention
during training. Leveraging this framework, we further propose an algorithm, iterative
influence-based filtering (IIF), for online RL training that iteratively performs experience
filtering to refine policy updates. Across standard RL benchmarks (classic control,
navigation, locomotion) to RLHF for large language models, IIF reduces sample complexity,
speeds up training, and achieves higher returns. Overall, these results advance
interpretability, efficiency, and effectiveness of online RL.
Understanding and Improving Adversarial Robustness of Neural Probabilistic Circuits
W. Chen, H. Zhao
In Proceedings of the 39th Advances in Neural Information Processing Systems (NeurIPS
2025)
[abs]
[pdf] [code]
Neural Probabilistic Circuits (NPCs), a new class of concept bottleneck models, comprise an
attribute recognition model and a probabilistic circuit for reasoning. By integrating the
outputs from these two modules, NPCs produce compositional and interpretable predictions.
While offering enhanced interpretability and high performance on downstream tasks, the
neural-network-based attribute recognition model remains a black box. This vulnerability
allows adversarial attacks to manipulate attribute predictions by introducing carefully
crafted subtle perturbations to input images, potentially compromising the final
predictions. In this paper, we theoretically analyze the adversarial robustness of NPC and
demonstrate that it only depends on the robustness of the attribute recognition model and is
independent of the robustness of the probabilistic circuit. Moreover, we propose RNPC, the
first robust neural probabilistic circuit against adversarial attacks on the recognition
module. RNPC introduces a novel class-wise integration for inference, ensuring a robust
combination of outputs from the two modules. Our theoretical analysis demonstrates that RNPC
exhibits provably improved adversarial robustness compared to NPC. Empirical results on
image classification tasks show that RNPC achieves superior adversarial robustness compared
to existing concept bottleneck models while maintaining high accuracy on benign inputs.
Neural Probabilistic Circuits: Enabling Compositional and Interpretable Predictions through
Logical Reasoning
W. Chen, S. Yu, H. Shao, L. Sha, H. Zhao
arXiv preprint
[abs]
[pdf] [code]
End-to-end deep neural networks have achieved remarkable success across various domains but
are often criticized for their lack of interpretability. While post hoc explanation methods
attempt to address this issue, they often fail to accurately represent these black-box
models, resulting in misleading or incomplete explanations. To overcome these challenges, we
propose an inherently transparent model architecture called Neural Probabilistic Circuits
(NPCs), which enable compositional and interpretable predictions through logical reasoning.
In particular, an NPC consists of two modules: an attribute recognition model, which
predicts probabilities for various attributes, and a task predictor built on a probabilistic
circuit, which enables logical reasoning over recognized attributes to make class
predictions. To train NPCs, we introduce a three-stage training algorithm comprising
attribute recognition, circuit construction, and joint optimization. Moreover, we
theoretically demonstrate that an NPC's error is upper-bounded by a linear combination of
the errors from its modules. To further demonstrate the interpretability of NPC, we provide
both the most probable explanations and the counterfactual explanations. Empirical results
on four benchmark datasets show that NPCs strike a balance between interpretability and
performance, achieving results competitive even with those of end-to-end black-box models
while providing enhanced interpretability.
Efficient Utility-Preserving Machine Unlearning with Implicit Gradient Surgery
S. Zhou, T. Yu, Z. Zhang, H. Chang, X. Zhou, D. Wu, H. Zhao
In Proceedings of the 39th Advances in Neural Information Processing Systems (NeurIPS
2025)
[abs]
[pdf]
Machine unlearning (MU) aims to efficiently remove sensitive or harmful memory from a
pre-trained model. The key challenge is to balance the potential tradeoff between unlearning
efficacy and utility preservation, which involves forgetting undesirable information as
defined while maintaining the model's original performance. One potential way to tackle this
problem is to use multi-objective optimization to jointly optimize both the unlearning and
utility preservation objectives. However, existing multi-objective methods only guarantee
finding a Pareto-optimal solution without fine-grained control, which causes
under-optimization of the unlearning objective. To this end, we first model MU as a
constrained optimization problem, that is, optimizing the unlearning objective under the
constraint of a bounded increase for utility loss. We then show that solving this
optimization problem is equivalent to unilateral gradient surgery on the unlearning
objective. To resolve the additional computational cost brought by gradient surgery, we
propose an implicit gradient surgery method, which approximates the solution to the
aforementioned constrained optimization problem via only one backpropagation, thereby
achieving efficient utility-preserving MU. Theoretically, we provide a tight convergence
analysis of the algorithm. Empirically, our extensive experiments show that the proposed
algorithm achieves better tradeoff results than existing baselines. Theoretically, we
provide a tight convergence analysis of the algorithm. Empirically, our extensive
experiments show that the proposed algorithm achieves better tradeoff results than existing
baselines.
Taming Hyperparameter Sensitivity in Data Attribution: Practical Selection Without Costly
Retraining
W. Wang, J. Deng, Y. Hu, S. Zhang, X. Jiang, R. Zhang, H. Zhao, J. W. Ma
In Proceedings of the 39th Advances in Neural Information Processing Systems (NeurIPS
2025)
[abs]
[pdf]
Data attribution methods, which quantify the influence of individual training data points on
a machine learning model, have gained increasing popularity in data-centric applications in
modern AI. Despite a recent surge of new methods developed in this space, the impact of
hyperparameter tuning in these methods remains under-explored. In this work, we present the
first large-scale empirical study to understand the hyperparameter sensitivity of common
data attribution methods. Our results show that most methods are indeed sensitive to certain
key hyperparameters. However, unlike typical machine learning algorithms -- whose
hyperparameters can be tuned using computationally-cheap validation metrics -- evaluating
data attribution performance often requires retraining models on subsets of training data,
making such metrics prohibitively costly for hyperparameter tuning. This poses a critical
open challenge for the practical application of data attribution methods. To address this
challenge, we advocate for better theoretical understandings of hyperparameter behavior to
inform efficient tuning strategies. As a case study, we provide a theoretical analysis of
the regularization term that is critical in many variants of influence function methods.
Building on this analysis, we propose a lightweight procedure for selecting the
regularization value without model retraining, and validate its effectiveness across a range
of standard data attribution benchmarks. Overall, our study identifies a fundamental yet
overlooked challenge in the practical application of data attribution, and highlights the
importance of careful discussion on hyperparameter selection in future method development.
GraSS: Scalable Influence Function with Sparse Gradient Compression
P. Hu, J. Melkonian, W. Tang, H. Zhao, J. W. Ma
In Proceedings of the 39th Advances in Neural Information Processing Systems (NeurIPS
2025)
[abs]
[pdf] [code]
Gradient-based data attribution methods, such as influence functions, are critical for
understanding the impact of individual training samples without requiring repeated model
retraining. However, their scalability is often limited by the high computational and memory
costs associated with per-sample gradient computation. In this work, we propose GraSS, a
novel gradient compression algorithm and its variants FactGraSS for linear layers
specifically, that explicitly leverage the inherent sparsity of per-sample gradients to
achieve sub-linear space and time complexity. Extensive experiments demonstrate the
effectiveness of our approach, achieving substantial speedups while preserving data
influence fidelity. In particular, FactGraSS achieves up to 165% faster throughput on
billion-scale models compared to the previous state-of-the-art baselines. Our code is
publicly available at https://github.com/TRAIS-Lab/GraSS.
MergeBench: A Benchmark for Merging Domain-Specialized LLMs
Y. He, S. Zeng, Y. Hu, R. Yang, T. Zhang, and H. Zhao
In Proceedings of the 39th Advances in Neural Information Processing Systems, Track on Datasets
and Benchmarks (NeurIPS 2025, D&B Track)
[abs]
[pdf] [project page] [code] [Hugging Face models]
Model merging provides a scalable alternative to multi-task training by combining
specialized finetuned models through parameter arithmetic, enabling efficient deployment
without the need for joint training or access to all task data. While recent methods have
shown promise, existing evaluations are limited in both model scale and task diversity,
leaving open questions about their applicability to large, domain-specialized LLMs. To
tackle the challenges, we introduce MergeBench, a comprehensive evaluation suite designed to
assess model merging at scale. MergeBench builds on state-of-the-art open-source language
models, including Llama and Gemma families at 2B to 9B scales, and covers five key domains:
instruction following, mathematics, multilingual understanding, coding and safety. We
standardize finetuning and evaluation protocols, and assess eight representative merging
methods across multi-task performance, forgetting and runtime efficiency. Based on extensive
experiments, we provide practical guidelines for algorithm selection and share insights
showing that model merging tends to perform better on stronger base models, with techniques
such as merging coefficient tuning and sparsification improving knowledge retention.
However, several challenges remain, including the computational cost on large models, the
gap for in-domain performance compared to multi-task models, and the underexplored role of
model merging in standard LLM training pipelines. We hope MergeBench provides a foundation
for future research to advance the understanding and practical application of model merging.
Our project page is at https://github.com/uiuctml/MergeBench.
MiCRo: Mixture Modeling and Context-aware Routing for Personalized Preference Learning
J. Shen, J. Yao, R. Yang, Y. Sun, F. Luo, R. Pan, T. Zhang, H. Zhao
In Proceedings of the Association for Computational Linguistics: EMNLP (EMNLP 2025,
Outstanding Paper Award)
[abs]
[pdf] [code] [poster]
Reward modeling is a key step in building safe foundation models when applying reinforcement
learning from human feedback (RLHF) to align Large Language Models (LLMs). However, reward
modeling based on the Bradley-Terry (BT) model assumes a global reward function, failing to
capture the inherently diverse and heterogeneous human preferences. Hence, such
oversimplification limits LLMs from supporting personalization and pluralistic alignment.
Theoretically, we show that when human preferences follow a mixture distribution of diverse
subgroups, a single BT model has an irreducible error. While existing solutions, such as
multi-objective learning with fine-grained annotations, help address this issue, they are
costly and constrained by predefined attributes, failing to fully capture the richness of
human values. In this work, we introduce MiCRo, a two-stage framework that enhances
personalized preference learning by leveraging large-scale binary preference datasets
without requiring explicit fine-grained annotations. In the first stage, MiCRo introduces
context-aware mixture modeling approach to capture diverse human preferences. In the second
stage, MiCRo integrates an online routing strategy that dynamically adapts mixture weights
based on specific context to resolve ambiguity, allowing for efficient and scalable
preference adaptation with minimal additional supervision. Experiments on multiple
preference datasets demonstrate that MiCRo effectively captures diverse human preferences
and significantly improves downstream personalization.
Moment Alignment: Unifying Gradient and Hessian Matching for Domain Generalization
Y. Chen, H. Si, G. Zhang, H. Zhao
In Proceedings of the 41st conference on Uncertainty in Artificial Intelligence (UAI
2025)
[abs]
[pdf]
Domain generalization (DG) seeks to develop models that generalize well to unseen target
domains, addressing the prevalent issue of distribution shifts in real-world applications.
One line of research in DG focuses on aligning domain-level gradients and Hessians to
enhance generalization. However, existing methods are computationally inefficient and the
underlying principles of these approaches are not well understood. In this paper, we develop
the theory of moment alignment for DG. Grounded in \textit{transfer measure}, a principled
framework for quantifying generalizability between two domains, we first extend the
definition of transfer measure to domain generalization that includes multiple source
domains and establish a target error bound. Then, we prove that aligning derivatives across
domains improves transfer measure both when the feature extractor induces an invariant
optimal predictor across domains and when it does not. Notably, moment alignment provides a
unifying understanding of Invariant Risk Minimization, gradient matching, and Hessian
matching, three previously disconnected approaches to DG. We further connect feature moments
and derivatives of the classifier head, and establish the duality between feature learning
and classifier fitting. Building upon our theory, we introduce \textbf{C}losed-Form
\textbf{M}oment \textbf{A}lignment (CMA), a novel DG algorithm that aligns domain-level
gradients and Hessians in closed-form. Our method overcomes the computational inefficiencies
of existing gradient and Hessian-based techniques by eliminating the need for repeated
backpropagation or sampling-based Hessian estimation. We validate the efficacy of our
approach through two sets of experiments: linear probing and full fine-tuning. CMA
demonstrates superior performance in both settings compared to Empirical Risk Minimization
and state-of-the-art algorithms.
Multiobjective Distribution Matching
X. Zhang, P. Li, Y. Yu, Y. Zhang, H. Zhao, Q. Zhang
In Proceedings of the 42nd International Conference on Machine Learning (ICML 2025)
[abs]
[pdf]
Distribution matching is a core concept in machine learning, with applications in generative
models, domain adaptation, and algorithmic fairness. A closely related but less explored
challenge is generating a distribution that aligns with multiple underlying distributions,
often with conflicting objectives, known as a Pareto optimal distribution.
In this paper, we develop a general theory based on information geometry to construct the
Pareto set and front for the entire exponential family under KL and inverse KL divergences.
This formulation allows explicit derivation of the Pareto set and front for multivariate
normal distributions, enabling applications like multiobjective variational autoencoders
(MOVAEs) to generate interpolated image distributions.
Experimental results on real-world images demonstrate that both algorithms can generate
high-quality interpolated images across multiple distributions.
Scaling Laws for Multilingual Language Models
Y. He, A. Benhaim, B. Patra, P. Vaddamanu, S. Ahuja, P. Chopra, V. Chaudhary, H. Zhao,
X. Song
In Proceedings of the 63th Annual Meeting of the Association for Computational Linguistics
(ACL 2025 Findings)
[abs]
[pdf]
We propose a novel scaling law for general-purpose decoder-only language models (LMs)
trained on multilingual data, tackling the problem of balancing languages during
multilingual pretraining. A primary challenge in studying multilingual scaling is the
difficulty of analyzing individual language performance due to cross-lingual transfer. To
address this, we shift the focus from individual languages to language families. We
introduce and validate a hypothesis that the test cross-entropy loss for each language
family is determined solely by its own sampling ratio, independent of other languages in the
mixture. This insight simplifies the complexity of multilingual scaling and make the
analysis scalable to an arbitrary number of languages. Building on this hypothesis, we
derive a power-law relationship that links performance with dataset size, model size and
sampling ratios. This relationship enables us to predict performance across various
combinations of the above three quantities, and derive the optimal sampling ratios at
different model scales. To demonstrate the effectiveness and accuracy of our proposed
scaling law, we perform a large-scale empirical study, training more than 100 models on 23
languages spanning 5 language families. Our experiments show that the optimal sampling
ratios derived from small models (85M parameters) generalize effectively to models that are
several orders of magnitude larger (1.2B parameters), offering a resource-efficient approach
for multilingual LM training at scale.
An Improved Autoregressive Evaluation Paradigm for Large Language Models
J. Zhang, R. Pan, Y. Hu, K. Shum, G. Yao, X. Liu, R. Pi, H. Dong, S. Diao, Y. Lin, H.
Zhao, T. Zhang
In ACM Transactions on Intelligent Systems and Technology (TIST 2025).
[abs]
[pdf]
The AI community has witnessed the emergence of various chat-style Large Language Models
(LLMs) since
the advent of ChatGPT. Despite significant progress in this area, evaluating these models
remains a substantial
challenge. The evaluations provided by humans or GPT-4 oracles are often taken as the gold
standard, but
they are neither automatic nor scalable. More recently, a series of (open-source) LLM-based
judge models have
been introduced, yet they often exhibit model-specific biases, e.g., a LLaMA-family judge
favors a LLaMAfamily model. On the other hand, autoregressive evaluation metrics, which
holds the potential to address the
aforementioned issues, remains underexplored. Among them, likelihood-based metrics such as
perplexity
and negative log-likelihood (NLL) are widely adopted and has proven effective in tracking
the pretraining
progress of LLMs. However, they struggle to evaluate the generation capabilities of
fine-tuned models due
to exposure bias, a phenomenon where the distribution of the model's output gradually
deviates from the ground-truth during inference. To address this key issue, in this paper,
we propose a novel autoregressive
metric, Normalized Discounted Cumulative Gain (NDCG), to improve the evaluation of
fine-tuned LLMs. Our
experimental results demonstrate that NDCG significantly outperforms likelihood-based
metrics: it shows
over 45% improvement in both Spearman and Kendall's tau correlation coefficients for
commonsense QA
tasks, and aligns more closely with GPT-4 Elo rankings for instruction-tuned models.
Learning Structured Representations by Embedding Class Hierarchy with Fast Optimal
Transport
S. Zeng, S. Du, M. Yamada, H. Zhao
In Proceedings of the 13th International Conference on Learning Representations (ICLR
2025)
[abs]
[pdf] [code]
To embed structured knowledge within labels into feature representations, prior work (Zeng
et al., 2022) proposed to use the Cophenetic Correlation Coefficient (CPCC) as a regularizer
during supervised learning. This regularizer calculates pairwise Euclidean distances of
class means and aligns them with the corresponding shortest path distances derived from the
label hierarchy tree. However, class means may not be good representatives of the class
conditional distributions, especially when they are multi-mode in nature. To address this
limitation, under the CPCC framework, we propose to use the Earth Mover's Distance (EMD) to
measure the pairwise distances among classes in the feature space. We show that our exact
EMD method generalizes previous work, and recovers the existing algorithm when
class-conditional distributions are Gaussian in the feature space. To further improve the
computational efficiency of our method, we introduce the Optimal Transport-CPCC family by
exploring four EMD approximation variants. Our most efficient OT-CPCC variant runs in linear
time in the size of the dataset, while maintaining competitive performance across datasets
and tasks. The code is available at https://github.com/uiuctml/OTCPCC.
Accelerating Neural ODEs: A Variational Formulation-based Approach
H. Zhao, Y. Wang, H. Qi, Z. Huang, H. Zhao, L. Sha, H. Shao
In Proceedings of the 13th International Conference on Learning Representations (ICLR
2025)
[abs]
[pdf] [code]
Neural Ordinary Differential Equations (Neural ODEs or NODEs) excel at modeling continuous
dynamical systems from observational data, especially when the data is irregularly sampled.
However, existing training methods predominantly rely on numerical ODE solvers, which are
time-consuming and prone to accumulating numerical errors over time due to autoregression.
In this work, we propose VF-NODE, a novel approach based on the variational formulation (VF)
to accelerate the training of NODEs. Unlike existing training methods, the proposed VF-NODEs
implement a series of global integrals, thus evaluating Deep Neural Network (DNN)--based
vector fields only at specific observed data points. This strategy drastically reduces the
number of function evaluations (NFEs). Moreover, our method eliminates the use of
autoregression, thereby reducing error accumulations for modeling dynamical systems.
Nevertheless, the VF loss introduces oscillatory terms into the integrals when using the
Fourier basis. We incorporate Filon's method to address this issue. To further enhance the
performance for noisy and incomplete data, we employ the natural cubic spline regression to
estimate a closed-form approximation. We provide a fundamental analysis of how our approach
minimizes computational costs. Extensive experiments demonstrate that our approach
accelerates NODE training by 10 to 1000 times compared to existing NODE-based methods, while
achieving higher or comparable accuracy in dynamical systems. The code is available at
https://github.com/ZhaoHongjue/VF-NODE-ICLR2025.
Towards Understanding the Fragility of Multilingual LLMs against Fine-Tuning Attacks
S. Poppi, ZX. Yong, Y. He, B. Chern, H. Zhao, A. Yang, J. Chi
In Findings of the 2025 Annual Conference of the Nations of the Americas Chapter of the
Association for Computational Linguistics (NAACL Findings 2025)
[abs]
[pdf]
Recent advancements in Large Language Models (LLMs) have sparked widespread concerns about
their safety. Recent work demonstrates that safety alignment of LLMs can be easily removed
by fine-tuning with a few adversarially chosen instruction-following examples, i.e.,
fine-tuning attacks. We take a further step to understand fine-tuning attacks in
multilingual LLMs. We first discover cross-lingual generalization of fine-tuning attacks:
using a few adversarially chosen instruction-following examples in one language,
multilingual LLMs can also be easily compromised (e.g., multilingual LLMs fail to refuse
harmful prompts in other languages). Motivated by this finding, we hypothesize that
safety-related information is language-agnostic and propose a new method termed Safety
Information Localization (SIL) to identify the safety-related information in the model
parameter space. Through SIL, we validate this hypothesis and find that only changing 20% of
weight parameters in fine-tuning attacks can break safety alignment across all languages.
Furthermore, we provide evidence to the alternative pathways hypothesis for why freezing
safety-related parameters does not prevent fine-tuning attacks, and we demonstrate that our
attack vector can still jailbreak LLMs adapted to new languages.
Localize-and-Stitch: Efficient Model Merging via Sparse Task Arithmetic
Y. He, Y. Hu, Y. Lin, T. Zhang, H. Zhao
Transactions on Machine Learning Research (TMLR 2025, J2C certificate)
[abs]
[pdf] [code]
Model merging offers an effective strategy to combine the strengths of multiple finetuned
models into a unified model that preserves the specialized capabilities of each. Existing
methods merge models in a global manner, performing arithmetic operations across all model
parameters. However, such global merging often leads to task interference, degrading the
performance of the merged model. In this work, we introduce Localize-and-Stitch, a novel
approach that merges models in a localized way. Our algorithm works in two steps: i)
Localization: identify tiny (1% of the total parameters) localized regions in the finetuned
models containing essential skills for the downstream tasks, and ii) Stitching: reintegrate
only these essential regions back into the pretrained model for task synergy. We demonstrate
that our approach effectively locates sparse regions responsible for finetuned performance,
and the localized regions could be treated as compact and interpretable representations of
the finetuned models (tasks). Empirically, we evaluate our method on various vision and
language benchmarks, showing that it outperforms existing model merging methods under
different data availability scenarios. Beyond strong empirical performance, our algorithm
also facilitates model compression and preserves pretrained knowledge, enabling flexible and
continual skill composition from multiple finetuned models with minimal storage and
computational overhead.
Most Influential Subset Selection: Challenges, Promises, and Beyond
Y. Hu, P. Hu, H. Zhao, J. Ma
In Proceedings of the 38th Advances in Neural Information Processing Systems (NeurIPS
2024)
[abs]
[pdf] [poster]
How can we attribute the behaviors of machine learning models to their training data? While
the classic influence function sheds light on the impact of individual samples, it often
fails to capture the more complex and pronounced collective influence of a set of samples.
To tackle this challenge, we study the Most Influential Subset Selection (MISS) problem,
which aims to identify a subset of training samples with the greatest collective influence.
We conduct a comprehensive analysis of the prevailing approaches in MISS, elucidating their
strengths and weaknesses. Our findings reveal that influence-based greedy heuristics, a
dominant class of algorithms in MISS, can provably fail even in linear regression. We
delineate the failure modes, including the errors of influence function and the non-additive
structure of the collective influence. Conversely, we demonstrate that an adaptive version
of these heuristics which applies them iteratively, can effectively capture the interactions
among samples and thus partially address the issues. Experiments on real-world datasets
corroborate these theoretical findings, and further demonstrate that the merit of adaptivity
can extend to more complex scenarios such as classification tasks and non-linear neural
networks. We conclude our analysis by emphasizing the inherent trade-off between performance
and computational efficiency, questioning the use of additive metrics such as the linear
datamodeling score, and offering a range of discussions.
Learning Structured Representations with Hyperbolic Embeddings
A. Sinha, S. Zeng, M. Yamada, H. Zhao
In Proceedings of the 38th Advances in Neural Information Processing Systems (NeurIPS
2024)
[abs]
[pdf] [code] [poster]
Most real-world datasets consist of a natural hierarchy between classes or an inherent label
structure that is either already available or can be constructed cheaply. However, most
existing representation learning methods ignore this hierarchy, treating labels as
permutation invariant. Recent work [Zeng et al., 2022] proposes using this structured
information explicitly, but the use of Euclidean distance may distort the underlying
semantic context [Chen et al., 2013]. In this work, motivated by the advantage of hyperbolic
spaces in modeling hierarchical relationships, we propose a novel approach HypStructure: a
Hyperbolic Structured regularization approach to accurately embed the label hierarchy into
the learned representations. HypStructure is a simple-yet-effective regularizer that
consists of a hyperbolic tree-based representation loss along with a centering loss, and can
be combined with any standard task loss to learn hierarchy-informed features. Extensive
experiments on several large-scale vision benchmarks demonstrate the efficacy of
HypStructure in reducing distortion and boosting generalization performance especially under
low dimensional scenarios. For a better understanding of structured representation, we
perform eigenvalue analysis that links the representation geometry to improved
Out-of-Distribution (OOD) detection performance seen empirically.
On the Expressive Power of Tree-Structured Probabilistic Circuits
L. Yin, H. Zhao
In Proceedings of the 38th Advances in Neural Information Processing Systems (NeurIPS
2024)
[abs]
[pdf] [poster]
Probabilistic circuits (PCs) have emerged as a powerful framework to compactly represent
probability distributions for efficient and exact probabilistic inference. It has been shown
that PCs with a general directed acyclic graph (DAG) structure can be understood as a
mixture of exponentially (in its height) many components, each of which is a product
distribution over univariate marginals. However, existing structure learning algorithms for
PCs often generate tree-structured circuits or use tree-structured circuits as intermediate
steps to compress them into DAG-structured circuits. This leads to the intriguing question
of whether there exists an exponential gap between DAGs and trees for the PC structure. In
this paper, we provide a negative answer to this conjecture by proving that, for n
variables, there exists a quasi-polynomial upper bound $n^O(\log n)$ on the size of an
equivalent
tree computing the same probability distribution. On the other hand, we also show that given
a depth restriction on the tree, there is a super-polynomial separation between tree and
DAG-structured PCs. Our work takes an important step towards understanding the expressive
power of tree-structured PCs, and our techniques may be of independent interest in the study
of structure learning algorithms for PCs.
FedGTST: Boosting Global Transferability of Federated Models via Statistics Tuning
E. Ma, C. Pan, S. Rasoul Etesami, H. Zhao, O. Milenkovic
In Proceedings of the 38th Advances in Neural Information Processing Systems (NeurIPS
2024)
[abs]
[pdf]
The performance of Transfer Learning (TL) heavily relies on effective pretraining, which
demands large datasets and substantial computational resources. As a result, executing TL is
often challenging for individual model developers. Federated Learning (FL) addresses these
issues by facilitating collaborations among clients, expanding the dataset indirectly,
distributing computational costs, and preserving privacy. However, key challenges remain
unresolved. First, existing FL methods tend to optimize transferability only within local
domains, neglecting the global learning domain. Second, most approaches rely on indirect
transferability metrics, which do not accurately reflect the final target loss or true
degree of transferability. To address these gaps, we propose two enhancements to FL. First,
we introduce a client-server exchange protocol that leverages cross-client Jacobian
(gradient) norms to boost transferability. Second, we increase the average Jacobian norm
across clients at the server, using this as a local regularizer to reduce cross-client
Jacobian variance. Our transferable federated algorithm, termed FedGTST (Federated Global
Transferability via Statistics Tuning), demonstrates that increasing the average Jacobian
and reducing its variance allows for tighter control of the target loss. This leads to an
upper bound on the target loss in terms of the source loss and source-target domain
discrepancy. Extensive experiments on datasets such as MNIST to MNIST-M and CIFAR10 to SVHN
show that FedGTST outperforms relevant baselines, including FedSR. On the second dataset
pair, FedGTST improves accuracy by 9.8% over FedSR and 7.6% over FedIIR when LeNet is used
as the backbone.
LibMOON: A Gradient-based MultiObjective OptimizatioN Library in PyTorch
X. Zhang, L. Zhao, Y. Yu, X. Lin, Y. Chen, H. Zhao, Q. Zhang
In Proceedings of the 38th Advances in Neural Information Processing Systems, Track on Datasets
and Benchmarks (NeurIPS
2024, D&B Track)
[abs]
[pdf] [code]
Multiobjective optimization problems (MOPs) are prevalent in machine learning, with
applications in multi-task learning, learning under fairness or robustness constraints, etc.
Instead of reducing multiple objective functions into a scalar objective, MOPs aim to
optimize for the so-called Pareto optimality or Pareto set learning, which involves
optimizing more than one objective function simultaneously, over models with thousands /
millions of parameters. Existing benchmark libraries for MOPs mainly focus on evolutionary
algorithms, most of which are zeroth-order / meta-heuristic methods that do not effectively
utilize higher-order information from objectives and cannot scale to large-scale models with
thousands / millions of parameters. In light of the above gap, this paper introduces
LibMOON, the first multiobjective optimization library that supports state-of-the-art
gradient-based methods, provides a fair benchmark, and is open-sourced for the community.
Interpretable Preferences via Multi-Objective Reward Modeling and Mixture-of-Experts
H. Wang, W. Xiong, T. Xie, H. Zhao, T. Zhang
In Proceedings of the Association for Computational Linguistics: EMNLP 2024 (EMNLP 2024
Findings)
[abs]
[pdf] [code]
Reinforcement learning from human feedback (RLHF) has emerged as the primary method for
aligning large language models (LLMs) with human preferences. The RLHF process typically
starts by training a reward model (RM) using human preference data. Conventional RMs are
trained on pairwise responses to the same user request, with relative ratings indicating
which response humans prefer. The trained RM serves as a proxy for human preferences.
However, due to the black-box nature of RMs, their outputs lack interpretability, as humans
cannot intuitively understand why an RM thinks a response is good or not. As RMs act as
human preference proxies, we believe they should be human-interpretable to ensure that their
internal decision processes are consistent with human preferences and to prevent reward
hacking in LLM alignment. To build RMs with interpretable preferences, we propose a
two-stage approach: i) train an Absolute-Rating Multi-Objective Reward Model (ArmoRM) with
multi-dimensional absolute-rating data, each dimension corresponding to a
human-interpretable objective (e.g., honesty, verbosity, safety); ii) employ a
Mixture-of-Experts (MoE) strategy with a gating network that automatically selects the most
suitable reward objectives based on the context. We efficiently trained an ArmoRM with
Llama-3 8B and a gating network consisting of a shallow MLP on top of the ArmoRM. Our
trained model, ArmoRM-Llama3-8B, obtains state-of-the-art performance on RewardBench, a
benchmark evaluating RMs for language modeling. Notably, the performance of our model
surpasses the LLM-as-a-judge method with GPT-4 judges by a margin, and approaches the
performance of the much larger Nemotron-4 340B reward model.
Semi-Supervised Reward Modeling via Iterative Self-Training
Y. He, H. Wang, Z. Jiang, A. Papangelis, H. Zhao
In Proceedings of the Association for Computational Linguistics: EMNLP 2024 (EMNLP 2024
Findings)
[abs]
[pdf] [code]
Reward models (RM) capture the values and preferences of humans and play a central role in
Reinforcement Learning with Human Feedback (RLHF) to align pretrained large language models
(LLMs). Traditionally, training these models relies on extensive human-annotated preference
data, which poses significant challenges in terms of scalability and cost. To overcome these
limitations, we propose Semi-Supervised Reward Modeling (SSRM), an approach that enhances RM
training using unlabeled data. Given an unlabeled dataset, SSRM involves three key iterative
steps: pseudo-labeling unlabeled examples, selecting high-confidence examples through a
confidence threshold, and supervised finetuning on the refined dataset. Across extensive
experiments on various model configurations, we demonstrate that SSRM significantly improves
reward models without incurring additional labeling costs. Notably, SSRM can achieve
performance comparable to models trained entirely on labeled data of equivalent volumes.
Overall, SSRM substantially reduces the dependency on large volumes of human-annotated data,
thereby decreasing the overall cost and time involved in training effective reward models.
Mitigating the Alignment Tax of RLHF
Y. Lin, H. Lin, W. Xiong, S. Diao, J. Liu, J. Zhang, R. Pan, H. Wang, W. Hu, H. Zhang, H. Dong,
R. Pi, H. Zhao, N. Jiang, H. Ji, Y. Yao, T. Zhang
In Proceedings of the Association for Computational Linguistics: EMNLP 2024 (EMNLP 2024)
[abs]
[pdf] [code]
LLMs acquire a wide range of abilities during pre-training, but aligning LLMs under
Reinforcement Learning with Human Feedback (RLHF) can lead to forgetting pretrained
abilities, which is also known as the alignment tax. To investigate alignment tax, we
conducted experiments with existing RLHF algorithms using OpenLLaMA-3B, which revealed a
pronounced alignment tax in NLP tasks. Whereas, despite various techniques to mitigate
forgetting, they are often at odds with the RLHF performance, leading to a trade-off between
alignment performance and forgetting mitigation, leading to an alignment-forgetting
trade-off.
In this paper we show that model averaging, which simply interpolates between pre and post
RLHF model weights, surprisingly achieves the most strongest alignment-forgetting Pareto
front among a wide range of competing methods. To understand its effectiveness, we offer
theoretical insights into model averaging, revealing that it enhances performance Pareto
front by increasing feature diversity on the layers where tasks share overlapped feature
spaces. Empirical evidence corroborates our analysis by showing the benefits of averaging
low-level transformer layers. Building on the analysis and the observation that averaging
different layers of the transformer leads to significantly different alignment-forgetting
trade-offs, we propose Heterogeneous Model Averaging (HMA) to Heterogeneously find various
combination ratios of model layers. HMA seeks to maximize the alignment performance while
incurring minimal alignment tax. Moreover, we validate HMA's performance across a range of
RLHF algorithms over OpenLLaMA-3B and further extend our findings to Mistral-7B which is
evaluated by open-sourced preference model and GPT4.
A Unified Post-Processing Framework for Group Fairness in Classification
R. Xian, H. Zhao
arXiv preprint
[abs]
[pdf] [code]
We present a post-processing algorithm for fair classification that covers group fairness
criteria including statistical parity, equal opportunity, and equalized odds under a single
framework, and is applicable to multiclass problems in both attribute-aware and
attribute-blind settings. Our algorithm, called "LinearPost", achieves fairness post-hoc by
linearly transforming the predictions of the (unfair) base predictor with a "fairness risk"
according to a weighted combination of the (predicted) group memberships. It yields the
Bayes optimal fair classifier if the base predictors being post-processed are Bayes optimal,
otherwise, the resulting classifier may not be optimal, but fairness is guaranteed as long
as the group membership predictor is multicalibrated. The parameters of the post-processing
can be efficiently computed and estimated from solving an empirical linear program.
Empirical evaluations demonstrate the advantage of our algorithm in the high fairness regime
compared to existing post-processing and in-processing fair classification algorithms.
Fair and Optimal Prediction via Post-Processing
H. Zhao
AI Magazine (an overview of our group's work on algorithmic fairness and more broadly,
trustworthy machine learning)
[abs]
[link]
With the development of machine learning algorithms and the increasing computational
resources available, artificial intelligence has achieved great success in many application
domains. However, the success of machine learning has also raised concerns about the
fairness of the learned models. For instance, the learned models can perpetuate and even
exacerbate the potential bias and discrimination in the training data. This issue has become
a major obstacle to the deployment of machine learning systems in high-stakes domains, for
example, criminal judgment, medical testing, online advertising, hiring process, and so
forth. To mitigate the potential bias exhibited by machine learning models, fairness
criteria can be integrated into the training process to ensure fair treatment across all
demographics, but it often comes at the expense of model performance. Understanding such
tradeoffs, therefore, is crucial to the design of optimal and fair algorithms. My research
focuses on characterizing the inherent tradeoff between fairness and accuracy in machine
learning, and developing algorithms that can achieve both fairness and optimality. In this
article, I will discuss our recent work on designing post-processing algorithms for fair
classification, which can be applied to a wide range of fairness criteria, including
statistical parity, equal opportunity, and equalized odds, under both attribute-aware and
attribute-blind settings, and is particularly suited to large-scale foundation models where
retraining is expensive or even infeasible. I will also discuss the connections between our
work and other related research on trustworthy machine learning, including the connections
between algorithmic fairness and differential privacy as well as adversarial robustness.
An Empirical Study of Self-Supervised Learning with Wasserstein Distance
M. Yamada, Y. Takezawa, G. Houry, K. Düsterwald, D. Sulem, H. Zhao, Y. H. Tsai
Entropy (Entropy 2024)
[abs]
[arXiv] [Entropy]
In this study, we delve into the problem of self-supervised learning (SSL) utilizing the
1-Wasserstein distance on a tree structure (a.k.a., Tree-Wasserstein distance (TWD)), where
TWD is defined as the L1 distance between two tree-embedded vectors. In SSL methods, the
cosine similarity is often utilized as an objective function; however, it has not been well
studied when utilizing the Wasserstein distance. Training the Wasserstein distance is
numerically challenging. Thus, this study empirically investigates a strategy for optimizing
the SSL with the Wasserstein distance and finds a stable training procedure. More
specifically, we evaluate the combination of two types of TWD (total variation and
ClusterTree) and several probability models, including the softmax function, the ArcFace
probability model, and simplicial embedding. We propose a simple yet effective Jeffrey
divergence-based regularization method to stabilize optimization. Through empirical
experiments on STL10, CIFAR10, CIFAR100, and SVHN, we find that a simple combination of the
softmax function and TWD can obtain significantly lower results than the standard SimCLR.
Moreover, a simple combination of TWD and SimSiam fails to train the model. We find that the
model performance depends on the combination of TWD and probability model, and that the
Jeffrey divergence regularization helps in model training. Finally, we show that the
appropriate combination of the TWD and probability model outperforms cosine similarity-based
representation learning.
Gradual Domain Adaptation: Theory and Algorithms
Y. He, H. Wang, B. Li, H. Zhao
Journal of Machine Learning Research (JMLR 2024)
(Extended version of our ICML 2022 paper under title "Understanding Gradual Domain Adaptation:
Improved Analysis, Optimal Path and Beyond")
[abs]
[arXiv] [JMLR] [code]
Unsupervised domain adaptation (UDA) adapts a model from a labeled source domain to an
unlabeled target domain in a one-off way. Though widely applied, UDA faces a great challenge
whenever the distribution shift between the source and the target is large. Gradual domain
adaptation (GDA) mitigates this limitation by using intermediate domains to gradually adapt
from the source to the target domain. In this work, we first theoretically analyze gradual
self-training, a popular GDA algorithm, and provide a significantly improved generalization
bound compared with Kumar et al. (2020). Our theoretical analysis leads to an interesting
insight: to minimize the generalization error on the target domain, the sequence of
intermediate domains should be placed uniformly along the Wasserstein geodesic between the
source and target domains. The insight is particularly useful under the situation where
intermediate domains are missing or scarce, which is often the case in real-world
applications. Based on the insight, we propose Generative Gradual DOmain Adaptation with
Optimal Transport (GOAT), an algorithmic framework that can generate intermediate domains in
a data-dependent way. More concretely, we first generate intermediate domains along the
Wasserstein geodesic between two given consecutive domains in a feature space, then apply
gradual self-training to adapt the source-trained classifier to the target along the
sequence of intermediate domains. Empirically, we demonstrate that our GOAT framework can
improve the performance of standard GDA when the given intermediate domains are scarce,
significantly broadening the real-world application scenarios of GDA. Our code is available
at https://github.com/yifei-he/GOAT.
Efficient Modality Selection in Multimodal Learning
Y. He, R. Cheng, G. Balasubramaniam, Y. H. Tsai, and H. Zhao
Journal of Machine Learning Research (JMLR 2024)
[abs]
[pdf]
Multimodal learning aims to learn from data of different modalities by fusing information
from heterogeneous sources. Although it is beneficial to learn from more modalities, it is
often infeasible to use all available modalities under limited computational resources.
Modeling with all available modalities can also be inefficient and unnecessary when
information across input modalities overlaps. In this paper, we study the modality selection
problem, which aims to select the most useful subset of modalities for learning under a
cardinality constraint. To that end, we propose a unified theoretical framework to quantify
the learning utility of modalities, and we identify dependence assumptions to flexibly model
the heterogeneous nature of multimodal data, which also allows efficient algorithm design.
Accordingly, we derive a greedy modality selection algorithm via submodular maximization,
which selects the most useful modalities with an optimality guarantee on learning
performance. We also connect marginal-contribution-based feature importance scores, such as
Shapley value, from the feature selection domain to the context of modality selection, to
efficiently compute the importance of individual modality. We demonstrate the efficacy of
our theoretical results and modality selection algorithms on 2 synthetic and 4 real-world
data sets on a diverse range of multimodal data.
Arithmetic Control of LLMs for Diverse User Preferences: Directional Preference Alignment
with Multi-Objective Rewards
H. Wang, Y. Lin, W. Xiong, R. Yang, S. Diao, S. Qiu, H. Zhao, T. Zhang
In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics
(ACL 2024)
[abs]
[arXiv] [code] [poster]
Fine-grained control over large language models (LLMs) remains a significant challenge,
hindering their adaptability to diverse user needs. While Reinforcement Learning from Human
Feedback (RLHF) shows promise in aligning LLMs, its reliance on scalar rewards often limits
its ability to capture diverse user preferences in real-world applications. To address this
limitation, we introduce the Directional Preference Alignment (DPA) framework. Unlike the
scalar-reward RLHF, DPA incorporates multi-objective reward modeling to represent diverse
preference profiles. Additionally, DPA models user preferences as directions (i.e., unit
vectors) in the reward space to achieve user-dependent preference control. Our method
involves training a multi-objective reward model and then fine-tuning the LLM with a
preference-conditioned variant of Rejection Sampling Finetuning (RSF), an RLHF method
adopted by Llama 2. This method enjoys a better performance trade-off across various reward
objectives. In comparison with the scalar-reward RLHF, DPA offers users intuitive control
over LLM generation: they can arithmetically specify their desired trade-offs (e.g., more
helpfulness with less verbosity). We also validate the effectiveness of DPA with real-world
alignment experiments on Mistral-7B. Our method provides straightforward arithmetic control
over the trade-off between helpfulness and verbosity while maintaining competitive
performance with strong baselines such as Direct Preference Optimization (DPO).
Differentially Private Post-Processing for Fair Regression
R. Xian, Q. Li, G. Kamath, H. Zhao
In Proceedings of the 41st International Conference on Machine Learning (ICML 2024)
[abs]
[pdf] [code]
This paper describes a differentially private post-processing algorithm for learning fair
regressors satisfying statistical parity, addressing privacy concerns of machine learning
models trained on sensitive data, as well as fairness concerns of their potential to
propagate historical biases. Our algorithm can be applied to post-process any given
regressor to improve fairness by remapping its outputs. It consists of three steps: first,
the output distributions are estimated privately via histogram density estimation and the
Laplace mechanism, then their Wasserstein barycenter is computed, and the optimal transports
to the barycenter are used for post-processing to satisfy fairness. We analyze the sample
complexity of our algorithm and provide fairness guarantee, revealing a trade-off between
the statistical bias and variance induced from the choice of the number of bins in the
histogram, in which using less bins always favors fairness at the expense of error.
Pairwise Alignment Improves Graph Domain Adaptation
S. Liu, D. Zou, H. Zhao, P. Li
In Proceedings of the 41st International Conference on Machine Learning (ICML 2024,
spotlight)
[abs]
[pdf]
Graph-based methods, pivotal for label inference over interconnected objects in many
real-world applications, often encounter generalization challenges, if the graph used for
model training differs significantly from the graph used for testing. This work delves into
Graph Domain Adaptation (GDA) to address the unique complexities of distribution shifts over
graph data, where interconnected data points experience shifts in features, labels, and in
particular, connecting patterns. We propose a novel, theoretically principled method,
Pairwise Alignment (Pair-Align) to counter graph structure shift by mitigating conditional
structure shift (CSS) and label shift (LS). Pair-Align uses edge weights to recalibrate the
influence among neighboring nodes to handle CSS and adjusts the classification loss with
label weights to handle LS. Our method demonstrates superior performance in real-world
applications, including node classification with region shift in social networks, and the
pileup mitigation task in particle colliding experiments. For the first application, we also
curate the largest dataset by far for GDA studies. Our method shows strong performance in
synthetic and other existing benchmark datasets.
Robust Multi-Task Learning with Excess Risks
Y. He, S. Zhou, G. Zhang, H. Yun, Y. Xu, B. Zeng, T. Chilimbi, H. Zhao
In Proceedings of the 41st International Conference on Machine Learning (ICML 2024)
[abs]
[pdf]
Multi-task learning (MTL) considers learning a joint model for multiple tasks by optimizing
a convex combination of all task losses. To solve the optimization problem, existing methods
use an adaptive weight updating scheme, where task weights are dynamically adjusted based on
their respective losses to prioritize difficult tasks. However, these algorithms face a
great challenge whenever label noise is present, in which case excessive weights tend to be
assigned to noisy tasks that have relatively large Bayes optimal errors, thereby
overshadowing other tasks and causing performance to drop across the board. To overcome this
limitation, we propose Multi-Task Learning with Excess Risks (ExcessMTL), an excess
risk-based task balancing method that updates the task weights by their distances to
convergence instead. Intuitively, ExcessMTL assigns higher weights to worse-trained tasks
that are further from convergence. To estimate the excess risks, we develop an efficient and
accurate method with Taylor approximation. Theoretically, we show that our proposed
algorithm achieves convergence guarantees and Pareto stationarity. Empirically, we evaluate
our algorithm on various MTL benchmarks and demonstrate its superior performance over
existing methods in the presence of label noise.
A Survey of Recent Methods for Addressing AI Fairness and Bias in Biomedicine
Y. Yang, M. Lin, H. Zhao, Y. Peng, F. Huang, Z. Lu
Journal of Biomedical Informatics (JBI 2024)
[abs]
[pdf] [arXiv]
Artificial intelligence (AI) systems have the potential to revolutionize clinical practices,
including improving diagnostic accuracy and surgical decision-making, while also reducing
costs and manpower. However, it is important to recognize that these systems may perpetuate
social inequities or demonstrate biases, such as those based on race or gender. Such biases
can occur before, during, or after the development of AI models, making it critical to
understand and address potential biases to enable the accurate and reliable application of
AI models in clinical settings. To mitigate bias concerns during model development, we
surveyed recent publications on different debiasing methods in the fields of biomedical
natural language processing (NLP) or computer vision (CV). Then we discussed the methods
that have been applied in the biomedical domain to address bias. We performed our literature
search on PubMed, ACM digital library, and IEEE Xplore of relevant articles published
between January 2018 and December 2023 using multiple combinations of keywords. We then
filtered the result of 10,041 articles automatically with loose constraints, and manually
inspected the abstracts of the remaining 890 articles to identify the 55 articles included
in this review. Additional articles in the references are also included in this review. We
discuss each method and compare its strengths and weaknesses. Finally, we review other
potential methods from the general domain that could be applied to biomedicine to address
bias and improve fairness. The bias of AIs in biomedicine can originate from multiple
sources. Existing debiasing methods that focus on algorithms can be categorized into
distributional or algorithmic.
Towards Practical Non-Adversarial Distribution Alignment via Variational Bounds
Z. Gong, B. Usman, H. Zhao and D. I. Inouye
In Proceedings of the 27th International Conference on Artificial Intelligence and Statistics
(AISTATS 2024)
[abs]
[pdf]
Distribution alignment can be used to learn invariant representations with applications in
fairness and robustness. Most prior works resort to adversarial alignment methods but the
resulting minimax problems are unstable and challenging to optimize. Non-adversarial
likelihood-based approaches either require model invertibility, impose constraints on the
latent prior, or lack a generic framework for alignment. To overcome these limitations, we
propose a non-adversarial VAE-based alignment method that can be applied to any model
pipeline. We develop a set of alignment upper bounds (including a noisy bound) that have
VAE-like objectives but with a different perspective. We carefully compare our method to
prior VAE-based alignment approaches both theoretically and empirically. Finally, we
demonstrate that our novel alignment losses can replace adversarial losses in standard
invariant representation learning pipelines without modifying the original architectures --
thereby significantly broadening the applicability of non-adversarial alignment methods.
Fast 1-Wasserstein distance approximations using greedy strategies
G. Houry, H. Bao, H. Zhao and M. Yamada
In Proceedings of the 27th International Conference on Artificial Intelligence and Statistics
(AISTATS 2024)
[abs]
[pdf] [poster]
Among numerous linear approximation methods proposed for optimal transport (OT), tree-based
methods appear to be fairly reliable, notably for language processing applications. Inspired
by these tree methods, we introduce several greedy heuristics aiming to compute even faster
approximations of OT. We first explicitly establish the equivalence between greedy matching
and optimal transport for tree metrics, and then we show that tree greedy matching can be
reduced to greedy matching on a one-dimensional line. Next, we propose two new greedy-based
algorithms in one dimension: the $k$-Greedy and 1D-ICT algorithms. This novel approach
provides Wasserstein approximations with accuracy similar to the original tree methods on
text datasets while being faster in practice. Finally, these algorithms are applicable
beyond tree approximations: using sliced projections of the original data still provides
fairly good accuracy while eliminating the need for embedding the data in a fixed and rigid
tree structure. This property makes these approaches even more versatile than the original
tree OT methods.
FFB: A Fair Fairness Benchmark for In-Processing Group Fairness Methods
X. Han, J. Chi, Y. Chen, Q. Wang, H. Zhao, N. Zou, X. Hu
In Proceedings of the 12th International Conference on Learning Representations (ICLR
2024)
[abs]
[pdf]
This paper introduces the Fair Fairness Benchmark (\textsf{FFB}), a benchmarking framework
for in-processing group fairness methods. Ensuring fairness in machine learning is critical
for ethical and legal compliance. However, there exist challenges in comparing and
developing of fairness methods due to inconsistencies in experimental settings, lack of
accessible algorithmic implementations, and limited extensibility of current fairness
packages and tools. To address these issues, we introduce an open-source, standardized
benchmark for evaluating in-processing group fairness methods and provide a comprehensive
analysis of state-of-the-art methods to ensure different notions of group fairness. This
work offers the following key contributions: the provision of flexible, extensible,
minimalistic, and research-oriented open-source code; the establishment of unified fairness
method benchmarking pipelines; and extensive benchmarking, which yields key insights from
45,079 experiments. We believe our work will significantly facilitate the growth and
development of the fairness research community.
RLHF Workflow: From Reward Modeling to Online RLHF
H. Dong, W. Xiong, B. Pang, H. Wang,
H. Zhao, Y. Zhou, N. Jiang, D. Sahoo,
C. Xiong, T. Zhang
Transactions on Machine Learning Research (TMLR 2024)
[abs]
[pdf] [code]
We present the workflow of Online Iterative Reinforcement Learning from Human Feedback
(RLHF) in this technical report, which is widely reported to outperform its offline
counterpart by a large margin in the recent large language model (LLM) literature. However,
existing open-source RLHF projects are still largely confined to the offline learning
setting. In this technical report, we aim to fill in this gap and provide a detailed recipe
that is easy to reproduce for online iterative RLHF. In particular, since online human
feedback is usually infeasible for open-source communities with limited resources, we start
by constructing preference models using a diverse set of open-source datasets and use the
constructed proxy preference model to approximate human feedback. Then, we discuss the
theoretical insights and algorithmic principles behind online iterative RLHF, followed by a
detailed practical implementation. Our trained LLM, \texttt{SFR-Iterative-DPO-LLaMA-3-8B-R},
achieves impressive performance on LLM chatbot benchmarks, including AlpacaEval-2,
Arena-Hard, and MT-Bench, as well as other academic benchmarks such as HumanEval and
TruthfulQA. We have shown that supervised fine-tuning (SFT) and iterative RLHF can obtain
state-of-the-art performance with fully open-source datasets. Further, we have made our
models, curated datasets, and comprehensive step-by-step code guidebooks publicly available.
Please refer to \url{https://github.com/RLHFlow/RLHF-Reward-Modeling} and
\url{https://github.com/RLHFlow/Online-RLHF} for more detailed information.
Enhancing Compositional Generalization via Compositional Feature Alignment
H. Wang, H. Si, H. Shao, H. Zhao
Transactions on Machine Learning Research (TMLR 2024)
[abs]
[pdf]
Real-world applications of machine learning models often confront data distribution shifts,
wherein discrepancies exist between the training and test data distributions. In the common
multi-domain multi-class setup, as the number of classes and domains scales up, it becomes
infeasible to gather training data for every domain-class combination. This challenge
naturally leads the quest for models with Compositional Generalization (CG) ability, where
models can generalize to unseen domain-class combinations. To delve into the CG challenge,
we develop CG-Bench, a suite of CG benchmarks derived from existing real-world image
datasets, and observe that the prevalent pretraining-finetuning paradigm on foundational
models, such as CLIP and DINOv2, struggles with the challenge. To address this challenge, we
propose Compositional Feature Alignment (CFA), a simple two-stage finetuning technique that
i) learns two orthogonal linear heads on a pretrained encoder with respect to class and
domain labels, and ii) fine-tunes the encoder with the newly learned head frozen. We
theoretically and empirically justify that CFA encourages compositional feature learning of
pretrained models. We further conduct extensive experiments on CG-Bench for CLIP and DINOv2,
two powerful pretrained vision foundation models. Experiment results show that CFA
outperforms common finetuning techniques in compositional generalization, corroborating
CFA's efficacy in compositional feature learning.
A General-Purpose Multi-Modal OOD Detection Framework
V. Duong, Q. Wu, Z. Zhou, E. Zavesky, W-L. Hsu, H. Zhao, H. Shao
Transactions on Machine Learning Research (TMLR 2024)
[abs]
[pdf]
Out-of-distribution (OOD) detection seeks to identify test samples that deviate from the
training data, which is critical to ensuring the safety and reliability of machine learning
(ML) systems. While a plethora of methods have been developed to detect uni-modal OOD
samples, only a few have focused on multi-modal OOD detection. Current contrastive
learning-based methods primarily address multi-modal OOD detection in a scenario where an
image is not related to the class labels in training data. However, ML systems in the
real-world applications may encounter a broader spectrum of anomalies caused by different
factors like systematic errors in labeling, environmental changes, and sensor malfunctions.
Hence, we propose a new method to be able to simultaneously detect anomalies from multiple
different OOD scenarios, arising from fine-grained image features and textual descriptions,
instead of large categorical information. To achieve this goal, we propose a general-purpose
weakly-supervised OOD detection framework, called WOOD, that combines a binary classifier
and a contrastive learning module to reap the benefits of both. In order to better
distinguish in-distribution (ID) samples from OOD ones, we employ the Hinge loss to
constrain the similarity of their latent representations. Moreover, we devise a new scoring
metric that fuses predictions from both the binary classifier and contrastive learning to
enhance OOD detection. Extensive experimental results on multiple benchmarks demonstrate
that the proposed WOOD significantly outperforms the state-of-the-art methods for
multi-modal OOD detection. Importantly, our approach can achieve superior detection
performance in a variety of OOD scenarios.
Personalized Federated Learning with Spurious Features: An Adversarial Approach
X. Wang, H. Zhao, K. Nahrstedt, S. Koyejo
Transactions on Machine Learning Research (TMLR 2024)
[abs]
[pdf]
One of the common approaches for personalizing federated learning is fine-tuning the global
model for each local client. While this addresses some issues of statistical heterogeneity,
we find that such personalization methods are vulnerable to spurious features at local
agents, leading to reduced generalization performance. This work considers a setup where
spurious features correlate with the label in each client's training environment, and the
mixture of multiple training environments (i.e., the global environment) diminishes the
spurious correlations. In other words, while the global federated learning model trained
over the global environment suffers less from spurious features, the local fine-tuning step
may lead to personalized models vulnerable to spurious correlations. In light of this
practical and pressing challenge, we propose a novel strategy to mitigate the effect of
spurious features during personalization by maintaining the adversarial transferability
between the global and personalized models. Empirical results on object and action
recognition tasks show that our proposed approach bounds personalized models from further
exploiting spurious features while preserving the benefit of enhanced accuracy from
fine-tuning.
Revisiting Scalarization in Multi-Task Learning: A Theoretical Perspective
Y. Hu, R. Xian, Q. Wu, Q. Fan, L. Yin, and H. Zhao
In Proceedings of the 37th Advances in Neural Information Processing Systems (NeurIPS
2023)
[abs]
[pdf] [poster] [slides] [video]
Linear scalarization, i.e., combining all loss functions by a weighted sum, has been the
default choice in the literature of multi-task learning (MTL) since its inception. In recent
years, there is a surge of interest in developing Specialized Multi-Task Optimizers (SMTOs)
that treat MTL as a multi-objective optimization problem. However, it remains open whether
there is a fundamental advantage of SMTOs over scalarization. In fact, heated debates exist
in the community comparing these two types of algorithms, mostly from an empirical
perspective. To approach the above question, in this paper, we revisit scalarization from a
theoretical perspective. We focus on linear MTL models and study whether scalarization is
capable of fully exploring the Pareto front. Our findings reveal that, in contrast to recent
works that claimed empirical advantages of scalarization, scalarization is inherently
incapable of full exploration, especially for those Pareto optimal solutions that strike the
balanced trade-offs between multiple tasks. More concretely, when the model is
under-parametrized, we reveal a multi-surface structure of the feasible region and identify
necessary and sufficient conditions for full exploration. This leads to the conclusion that
scalarization is in general incapable of tracing out the Pareto front. Our theoretical
results partially answer the open questions in Xin et al. (2021), and provide a more
intuitive explanation on why scalarization fails beyond non-convexity. We additionally
perform experiments on a real-world dataset using both scalarization and state-of-the-art
SMTOs. The experimental results not only corroborate our theoretical findings, but also
unveil the potential of SMTOs in finding balanced solutions, which cannot be achieved by
scalarization.
Efficient Learning of Linear Graph Neural Networks via Node Subsampling
S. Shin, I. Shomorony, and H. Zhao
In Proceedings of the 37th Advances in Neural Information Processing Systems (NeurIPS
2023)
[abs]
[pdf]
[poster]
Graph Neural Networks (GNNs) are a powerful class of machine learning models with
applications in recommender systems, drug discovery, social network analysis, and computer
vision. One challenge with their implementation is that GNNs often take large-scale graphs
as inputs, which imposes significant computational/storage costs in the training and testing
phases. In particular, the message passing operations of a GNN require multiplication of the
graph adjacency matrix A ∈\R^n \times n and the data matrix X ∈\R^n \times d, and the O(n^2
d) time complexity can be prohibitive for large n. Thus, a natural question is whether it is
possible to perform the GNN operations in (quasi-)linear time by avoiding the full
computation of A X. To study this question, we consider the setting of a regression task on
a two-layer Linear Graph Convolutional Network (GCN). We develop an efficient training
algorithm based on (1) performing node subsampling, (2) estimating the leverage scores of A
X based on the subsampled graph, and (3) performing leverage score sampling on A X. We show
that our proposed scheme learns the regression model observing only O(nd\eps^-2\log n)
entries of A in time O(nd^2 \eps^-2\log n), with the guarantee that the learned weights
deviate by at most εunder the \ell_2 norm from the model learned using the entire adjacency
matrix A. We present empirical results for regression problems on two real-world graphs and
show that our algorithm significantly outperforms other baseline sampling strategies that
exploit the same number of observations.
Learning List-Level Domain-Invariant Representations for Ranking
R. Xian, H. Zhuang, Z. Qin, H. Zamani, J. Lu, J. Ma, K. Hui, H. Zhao, X. Wang, M.
Bendersky
In Proceedings of the 37th Advances in Neural Information Processing Systems (NeurIPS 2023,
spotlight)
[abs]
[pdf] [poster]
Domain adaptation aims to transfer the knowledge acquired by models trained on (data-rich)
source domains to (low-resource) target domains, for which a popular method is invariant
representation learning. While they have been studied extensively for classification and
regression problems, how they apply to ranking problems, where the data and metrics have a
list structure, is not well understood. Theoretically, we establish a domain adaptation
generalization bound for ranking under listwise metrics such as MRR and NDCG. The bound
suggests an adaptation method via learning list-level domain-invariant feature
representations, whose benefits are empirically demonstrated by unsupervised domain
adaptation experiments on real-world ranking tasks, including passage reranking. A key
message is that for domain adaptation, the representations should be analyzed at the same
level at which the metric is computed, as we show that learning invariant representations at
the list level is most effective for adaptation on ranking problems.
Adaptation Augmented Model-based Policy Optimization
J. Shen, H. Lai, M. Liu, H. Zhao, Y. Yu, and W. Zhang
Journal of Machine Learning Research (JMLR 2023)
(Extended version of our NeurIPS 2020 paper under title "Model-based Policy Optimization with
Unsupervised Model Adaptation")
[abs]
[pdf]
Compared to model-free reinforcement learning (RL), model-based RL is often more sample
efficient by leveraging a learned dynamics model to help decision-making. However, the
learned model is usually not perfectly accurate and the error will compound in multi-step
predictions, which can lead to poor asymptotic performance. In this paper, we first derive
an upper bound of the return discrepancy between the real dynamics and the learned model,
which reveals the fundamental problem of distribution shift between simulated data and real
data.
Inspired by the theoretical analysis, we propose an adaptation augmented model-based policy
optimization (AMPO) framework to address the distribution shift problem from the
perspectives of feature learning and instance re-weighting, respectively. Specifically, the
feature-based variant, namely FAMPO, introduces unsupervised model adaptation to minimize
the integral probability metric (IPM) between feature distributions from real and simulated
data, while the instance-based variant, termed as IAMPO, utilizes importance sampling to
re-weight the real samples used to train the model.
Besides model learning, we also investigate how to improve policy optimization in the model
usage phase by selecting simulated samples with different probabilities according to their
uncertainty.
Extensive experiments on challenging continuous control tasks show that FAMPO and IAMPO,
coupled with our model usage technique, achieve superior performance against baselines,
which demonstrates the effectiveness of the proposed methods.
Train Your Own GNN Teacher: Graph-Aware Distillation on Textual Graphs
C. Mavromatis, V. N. Ioannidis, S. Wang, D. Zheng, S. Adeshina, J. Ma, H. Zhao, C.
Faloutsos, G. Karypis
In Proceedings of the European Conference on Machine Learning and Principles and Practice of
Knowledge Discovery in Databases (ECMLPKDD 2023)
[abs]
[pdf]
How can we learn effective node representations on textual graphs? Graph Neural Networks
(GNNs) that use Language Models (LMs) to encode textual information of graphs achieve
state-of-the-art performance in many node classification tasks. Yet, combining GNNs with LMs
has not been widely explored for practical deployments due to its scalability issues. In
this work, we tackle this challenge by developing a Graph-Aware Distillation framework
(GRAD) to encode graph structures into an LM for graph-free, fast inference. Different from
conventional knowledge distillation, GRAD jointly optimizes a GNN teacher and a graph-free
student over the graph's nodes via a shared LM. This encourages the graph-free student to
exploit graph information encoded by the GNN teacher while at the same time, enables the GNN
teacher to better leverage textual information from unlabeled nodes. As a result, the
teacher and the student models learn from each other to improve their overall performance.
Experiments in eight node classification benchmarks in both transductive and inductive
settings showcase GRAD's superiority over existing distillation approaches for textual
graphs.
Fair and Optimal Classification via Post-Processing
R. Xian, L. Yin, H. Zhao
In Proceedings of the 40th International Conference on Machine Learning (ICML
2023)
[abs]
[pdf] [code] [slides] [video] [AI Magazine]
Fairness in automated decision-making systems has gained increasing attention as their
applications expand to real-world high-stakes domains. To facilitate the design of fair ML
systems, it is essential to understand the potential trade-offs between fairness and
predictive power, and the construction of the optimal predictor under a given fairness
constraint. In this paper, for general classification problems under the group fairness
criterion of demographic parity (DP), we precisely characterize the trade-off between DP and
classification accuracy, referred to as the minimum cost of fairness. Our insight comes from
the key observation that finding the optimal fair classifier is equivalent to solving a
Wasserstein-barycenter problem under $\ell_1$-norm restricted to the vertices of the
probability simplex. Inspired by our characterization, we provide a construction of an
optimal fair classifier achieving this minimum cost via the composition of the Bayes
regressor and optimal transports from its output distributions to the barycenter. Our
construction naturally leads to an algorithm for post-processing any pre-trained predictor
to satisfy DP fairness, complemented with finite sample guarantees. Experiments on
real-world datasets verify and demonstrate the effectiveness of our approaches.
Understanding the Impact of Adversarial Robustness on Accuracy Disparity
Y. Hu, F. Wu, H. Zhang, and H. Zhao
In Proceedings of the 40th International Conference on Machine Learning (ICML
2023)
[abs]
[pdf]
While it has long been empirically observed that adversarial robustness may be at odds with
standard accuracy and may have further disparate impacts on different classes, it remains an
open question to what extent such observations hold and how the class imbalance plays a role
within. In this paper, we attempt to understand this question of accuracy disparity by
taking a closer look at linear classifiers under a Gaussian mixture model. We decompose the
impact of adversarial robustness into two parts: an inherent effect that will degrade the
standard accuracy on all classes, and the other caused by the class imbalance ratio, which
will increase the accuracy disparity compared to standard training. Furthermore, we also
extend our model to the general family of stable distributions. We demonstrate that while
the constraint of adversarial robustness consistently degrades the standard accuracy in the
balanced class setting, the class imbalance ratio plays a fundamentally different role in
accuracy disparity compared to the Gaussian case, due to the heavy tail of the stable
distribution. We additionally perform experiments on both synthetic and real-world datasets.
The empirical results not only corroborate our theoretical findings, but also suggest that
the implications may extend to nonlinear models over real-world datasets.
Structural Re-weighting Improves Graph Domain Adaptation
S. Liu, T. Li, Y. Feng, N. Tran, H. Zhao, Q. Qiu, and Pan Li
In Proceedings of the 40th International Conference on Machine Learning (ICML
2023)
[abs]
[pdf]
In many real-world applications, graph-structured data used for training and testing have
differences in distribution, such as in high energy physics (HEP) where simulation data used
for training may not match real experiments. Graph domain adaptation (GDA) is a method used
to address these differences. However, current GDA primarily works by aligning the
distributions of node representations output by a single graph neural network encoder shared
across the two domains, which may often yield sub-optimal solutions. This work examines
different impacts of distribution shifts caused by either graph structure or node attributes
and identifies a new type of shift, named conditional structure shift, which current GDA
approaches are provably sub-optimal to deal with. A novel approach, called structural
reweighting (StruRW), is proposed to address this issue and is tested on synthetic graphs,
four benchmark datasets, and a new application in HEP. StruRW has shown significant
performance improvement over the baselines in the settings with large graph structure
shifts, and reasonable performance improvement when node attribute shifts dominate.
Understanding and Constructing Latent Modality Structures in Multi-modal Representation
Learning
Q. Jiang, C. Chen, H. Zhao, L. Chen, Q. Ping, S. Dinh Tran, Y. Xu, B. Zeng, T. Chilimbi
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR
2023)
[abs]
[pdf]
Contrastive loss has been increasingly used in learning representations from multiple
modalities. In the limit, the nature of the contrastive loss encourages modalities to
exactly match each other in the latent space. Yet it remains an open question how the
modality alignment affects the downstream task performance. In this paper, based on an
information-theoretic argument, we first prove that exact modality alignment is sub-optimal
in general for downstream prediction tasks. Hence we advocate that the key of better
performance lies in meaningful latent modality structures instead of perfect modality
alignment. To this end, we propose three general approaches to construct latent modality
structures. Specifically, we design 1) a deep feature separation loss for intra-modality
regularization; 2) a Brownian-bridge loss for inter-modality regularization; and 3) a
geometric consistency loss for both intra- and inter-modality regularization. Extensive
experiments are conducted on two popular multi-modal representation learning frameworks: the
CLIP-based two-tower model and the ALBEF-based fusion model. We test our model on a variety
of tasks including zero/few-shot image classification, image-text retrieval, visual question
answering, visual reasoning, and visual entailment. Our method achieves consistent
improvements over existing methods, demonstrating the effectiveness and generalizability of
our proposed approach on latent modality structure regularization.
Costs and Benefits of Fair Regression
H. Zhao
Transactions on Machine Learning Research (TMLR 2023)
[abs]
[pdf]
Real-world applications of machine learning tools in high-stakes domains are often regulated
to be fair, in the sense that the predicted target should satisfy some quantitative notion
of parity with respect to a protected attribute. However, the exact tradeoff between
fairness and accuracy with a real-valued target is not entirely clear. In this paper, we
characterize the inherent tradeoff between statistical parity and accuracy in the regression
setting by providing a lower bound on the error of any attribute-blind fair regressor. Our
lower bound is sharp, algorithm-independent, and admits a simple interpretation: when the
moments of the target differ between groups, any fair algorithm has to make an error on at
least one of the groups. We further extend this result to give a lower bound on the joint
error of any (approximately) fair algorithm, using the Wasserstein distance to measure the
quality of the approximation. With our novel lower bound, we also show that the price paid
by a fair regressor that does not take the protected attribute as input is less than that of
a fair regressor with explicit access to the protected attribute. On the upside, we
establish the first connection between individual fairness, accuracy parity, and the
Wasserstein distance by showing that if a regressor is individually fair, it also
approximately verifies the accuracy parity, where the gap is again given by the Wasserstein
distance between the two groups. Inspired by our theoretical results, we develop a practical
algorithm for fair regression through the lens of representation learning, and conduct
experiments on a real-world dataset to corroborate our findings.
Learning Structured Representations by Embedding Class Hierarchy
S. Zeng, R. des Combes, H. Zhao
In Proceedings of the 11th International Conference on Learning Representations (ICLR
2023)
[abs]
[pdf] [code]
Existing models for learning representations in supervised classification problems are
permutation invariant with respect to class labels. However, structured knowledge about the
classes, such as hierarchical label structures, widely exists in many real-world datasets,
e.g., the ImageNet and CIFAR benchmarks. How to learn representations that can preserve such
structures among the classes remains an open problem. To approach this problem, given a tree
of class hierarchy, we first define a tree metric between any pair of nodes in the tree to
be the length of the shortest path connecting them. We then provide a method to learn the
hierarchical relationship of class labels by approximately embedding the tree metric in the
Euclidean space of features. More concretely, during supervised training, we propose to use
the Cophenetic Correlation Coefficient (CPCC) as a regularizer for the cross-entropy loss to
correlate the tree metric of classes and the Euclidean distance in the class-conditioned
representations. Our proposed regularizer is computationally lightweight and easy to
implement. Empirically, we demonstrate that this approach can help to learn more
interpretable representations due to the preservation of the tree metric, and leads to
better generalization in-distribution as well as under sub-population shifts over multiple
datasets.
Adaptive Power Method: Eigenvector Estimation from Sampled Data
S. Shin, H. Zhao, I. Shomorony
In Proceedings of the 34th International Conference on Algorithmic Learning Theory (ALT
2023)
[abs]
[pdf]
Computing the dominant eigenvectors of a matrix $A$ has many applications, such as principal
component analysis, spectral embedding, and PageRank. However, in general, this task relies
on the complete knowledge of the matrix $A$, which can be too large to store or even
infeasible to observe in many applications, e.g., large-scale social networks. Thus, a
natural question is how to accurately estimate the eigenvectors of $A$ when only partial
observations can be made by sampling entries from $A$. To this end, we propose the Adaptive
Power Method (\apmnospace), a variant of the well-known power method. At each power
iteration, \apm adaptively selects a subset of the entries of $A$ to observe based on the
current estimate of the top eigenvector. We show that \apm can estimate the dominant
eigenvector(s) of $A$ with squared error at most $\epsilon$ by observing roughly
$O(n\eps^{-2} \log^2 (n/\eps))$ entries of an $n\times n$ matrix. We present empirical
results for the problem of eigenvector centrality computation on two real-world graphs and
show that \apm significantly outperforms a non-adaptive estimation algorithm using the same
number of observations. Furthermore, in the context of eigenvector centrality, \apm can also
adaptively allocate the observation budget to selectively refine the estimate of nodes with
high centrality scores in the graph.
FedMM: A Communication Efficient Solver for Federated Adversarial Domain Adaptation
Y. Shen, J. Du, H. Zhao, Z. Ji, C. Ma, M. Gao
In Proceedings of the 22nd International Conference on Autonomous Agents and
Multiagent Systems (AAMAS 2023)
[abs]
[pdf]
Federated adversary domain adaptation is a unique distributed minimax training task due to
the heterogeneous data among different local clients, where each client only sees a subset
of the data that merely belongs to either the source or target domain. Despite the extensive
research in distributed minimax optimization, existing communication efficient solvers that
exploit multiple steps of the local update are still not able to generate satisfactory
solutions for federated adversarial domain adaptation because of the gradient divergence
issue among clients. To tackle this problem, we propose a distributed minimax optimizer,
referred to as FedMM, by introducing dual variables to bridge the gradient gap among
clients. This algorithm is effective even in the extreme case where each client has
different label classes and some clients only have unlabeled data. We prove that FedMM
admits benign convergence to a stationary point under domain-shifted unlabeled data. On a
variety of benchmark datasets, extensive experiments show that FedMM consistently achieves
both better communication savings and significant accuracy improvements over existing
federated optimizers based on the stochastic gradient descent ascent (SGDA) algorithm. When
training from scratch, for example, it outperforms other SGDA based federated average
methods by around 20% in accuracy over the same communication rounds; and it consistently
outperforms when training from pre-trained models.
Invariant Feature Subspace Recovery for Multi-Class Classification
G. Balasubramaniam, H. Wang, H. Zhao
NeurIPS Distribution Shifts (DistShift) Workshop, 2022 (NeurIPS 2022)
[abs] [pdf]
Domain generalization aims to learn a model over multiple training environments to
generalize to unseen environments. Recently, \cite{wang2022provable} proposed
Invariant-feature Subspace Recovery (ISR), a domain generalization algorithm that uses the
means of class-conditional data distributions to provably identify the invariant-feature
subspace under a given causal model. However, due to the specific assumptions of the causal
model, the original ISR algorithm is conditioned on a single class only, without utilizing
information from the rest of the classes. In this work, we consider the setting of
multi-class classification under a more general causal model, and propose an extension of
the ISR algorithm, called ISR-Multiclass. This proposed algorithm can provably recover the
invariant-feature subspace with $\lceil d_{spu}/k \rceil + 1$ environments, where $d_{spu}$
is the number of spurious features and $k$ is the number of classes. Empirically, we first
examine ISR-Multiclass in a synthetic dataset, and demonstrate its superiority over the
original ISR in the multi-class setting. Furthermore, we conduct experiments in Multiclass
Coloured MNIST, a semi-synthetic dataset with strong spurious correlations, and show that
ISR-Multiclass can significantly improve the robustness of neural nets trained by various
methods (e.g., ERM and IRM) against spurious correlations.
Algorithms and Theory for Supervised Gradual Domain Adaptation
J. Dong, S. Zhou, B. Wang, H. Zhao
Transactions on Machine Learning Research (TMLR 2022)
[abs]
[pdf]
The phenomenon of data distribution evolving over time has been observed in a range of
applications, calling for the need for adaptive learning algorithms. We thus study the
problem of supervised gradual domain adaptation, where labeled data from shifting
distributions are available to the learner along the trajectory, and we aim to learn a
classifier on a target data distribution of interest. Under this setting, we provide the
first generalization upper bound on the learning error under mild assumptions. Our results
are algorithm agnostic, general for a range of loss functions, and only depend linearly on
the averaged learning error across the trajectory. This shows significant improvement
compared to the previous upper bound for unsupervised gradual domain adaptation, where the
learning error on the target domain depends exponentially on the initial error on the source
domain. Compared with the offline setting of learning from multiple domains, our results
also suggest the potential benefits of the temporal structure among different domains in
adapting to the target one. Empirically, our theoretical results imply that learning proper
representations across the domains will effectively mitigate learning errors. Motivated by
these theoretical insights, we propose a min-max learning objective to learn the
representation and classifier simultaneously. Experimental results on both semi-synthetic
and large-scale real datasets corroborate our findings and demonstrate the effectiveness of
our objectives.
Fundamental Limits and Tradeoffs in Invariant Representation Learning
H. Zhao*, C. Dan*, B. Aragam, T. Jaakkola, G. Gordon, and P. Ravikumar
Journal of Machine Learning Research (JMLR 2022)
(Also presented at NeurIPS 2023 Journal Track)
[abs]
[JMLR] [arXiv] [poster]
A wide range of machine learning applications such as privacy-preserving learning,
algorithmic fairness, and domain adaptation/generalization among others, involve learning
\emph{invariant representations} of the data that aim to achieve two competing goals: (a)
maximize information or accuracy with respect to a target response, and (b) maximize
invariance or independence with respect to a set of protected features (e.g.\ for fairness,
privacy, etc). Despite their wide applicability, theoretical understanding of the optimal
tradeoffs --- with respect to accuracy, and invariance --- achievable by invariant
representations is still severely lacking. In this paper, we provide an information
theoretic analysis of such tradeoffs under both classification and regression settings. More
precisely, we provide a geometric characterization of the accuracy and invariance achievable
by any representation of the data; we term this feasible region the information plane. We
provide an inner bound for this feasible region for the classification case, and an exact
characterization for the regression case, which allows us to either bound or exactly
characterize the Pareto optimal frontier between accuracy and invariance. Although our
contributions are mainly theoretical, a key practical application of our results is in
certifying the potential sub-optimality of any given representation learning algorithm for
either classification or regression tasks. Our results shed new light on the fundamental
interplay between accuracy and invariance, and may be useful in guiding the design of future
representation learning algorithms.
Conditional Supervised Contrastive Learning for Fair Text Classification
J. Chi, W. Shand, Y. Yu, K.-W. Chang, H. Zhao, and Y. Tian
In Proceedings of the 2022 Empirical Methods in Natural Language Processing (EMNLP 2022
Findings)
[abs]
[pdf]
Contrastive representation learning has gained much attention due to its superior
performance in learning representations from both image and sequential data. However, the
learned representations could potentially lead to performance disparities in downstream
tasks, such as increased silencing of underrepresented groups in toxicity comment
classification. In light of this challenge, in this work, we study learning fair
representations that satisfy a notion of fairness known as equalized odds for text
classification via contrastive learning. Specifically, we first theoretically analyze the
connections between learning representations with a fairness constraint and
\emph{conditional supervised contrastive objectives}, and then propose to use conditional
supervised contrastive objectives to learn fair representations for text classification. We
conduct experiments on two text datasets to demonstrate the effectiveness of our approaches
in balancing the trade-offs between task performance and bias mitigation among existing
baselines for text classification. Furthermore, we also show that the proposed methods are
stable in different hyperparameter settings.
Exploring Gradient-based Multi-directional Controls in GANs
Z. Chen, R. Jiang, B. Duke, H. Zhao, and P. Aarabi
In Proceedings of the 17th European Conference on Computer Vision (ECCV 2022, oral)
[abs]
[pdf]
Generative Adversarial Networks (GANs) have been widely applied in modeling diverse image
distributions. However, despite its impressive applications, the structure of the latent
space in GANs largely remains as a black-box, leaving its controllable generation an open
problem, especially when spurious correlations between different semantic attributes exist
in the image distributions. To address this problem, previous methods typically learn linear
directions or individual channels that control semantic attributes in the image space.
However, they often suffer from imperfect disentanglement, or are unable to obtain
multi-directional controls. In this work, in light of the above challenges, we propose a
novel approach that discovers nonlinear controls, which enables multi-directional
manipulation as well as effective disentanglement, based on gradient information in the
learned GAN latent space. More specifically, we first learn interpolation directions by
following the gradients from classification networks trained separately on the attributes,
and then navigate the latent space by exclusively controlling channels activated for the
target attribute in the learned directions. Empirically, with small training data, our
approach is able to gain fine-grained controls over a diverse set of bi-directional and
multi-directional attributes, and we showcase its ability to achieve disentanglement
significantly better than state-of-the-art methods both qualitatively and quantitatively.
Greedy Modality Selection via Approximate Submodular Maximization
R. Cheng, G. Balasubramaniam, Y. He, Y. H. Tsai, H. Zhao
In Proceedings of the 38th Conference on Uncertainty in Artificial Intelligence (UAI
2022)
[abs] [pdf]
Multimodal learning considers learning from multi-modality data, aiming to fuse
heterogeneous sources of information. However, it is not always feasible to leverage all
available modalities due to memory constraints. Further, training on all the modalities may
be inefficient when redundant information exists within data, such as different subsets of
modalities providing similar performance. In light of these challenges, we study modality
selection, intending to efficiently select the most informative and complementary modalities
under certain computational constraints. We formulate a theoretical framework for optimizing
modality selection in multimodal learning and introduce a utility measure to quantify the
benefit of selecting a modality. For this optimization problem, we present efficient
algorithms when the utility measure exhibits monotonicity and approximate submodularity. We
also establish a novel correspondence between the utility measure and existing marginal
contribution scores based on the Shapely value. Last, we demonstrate the efficacy of our
algorithm on synthetic (Patch-MNIST) and real-world (PEMS-SF Traffic) datasets.
Understanding Gradual Domain Adaptation: Improved Analysis, Optimal Path and Beyond
H. Wang, B. Li, H. Zhao
In Proceedings of the 39th International Conference on Machine Learning (ICML
2022)
[abs] [pdf] [code]
The vast majority of existing algorithms for unsupervised domain adaptation (UDA) focus on
adapting from a labeled source domain to an unlabeled target domain directly in a one-off
way. Gradual domain adaptation (GDA), on the other hand, assumes a path of ($T\mathrm{-}1$)
unlabeled intermediate domains bridging the source and the target, and aims to provide
better generalization on the target domain by leveraging the intermediate ones. Under
certain assumptions, \citet{kumar2020understanding} proposed a simple algorithm,
\textit{gradual self-training}, along with a generalization bound in the order of
$e^{\mathcal O(T)}(\eps_0 \mathrm{+}\mathcal O\bigl(\sqrt {\frac{\log T}{n}}\bigr) \bigr)$
for the target domain error, where $\eps_0$ is the source domain error and $n$ is the data
size of each domain. Due to the exponential factor, this upper bound becomes vacuous when
$T$ is only moderately large. In this work, we analyze gradual self-training under more
general and relaxed assumptions, and prove a significantly improved generalization bound as
$\eps_0\mathrm{+}\widetilde{\mathcal O}\bigl(T\Delta \mathrm{+} \frac{T}{\sqrt{n}}
\mathrm{+} \frac{1}{\sqrt{nT}}\bigr)$, where $\Delta$ is the average distributional distance
between consecutive domains. Compared with the existing bound with an \emph{exponential}
dependency on $T$ as a \textit{multiplicative} factor, our bound only depends on $T$
\emph{linearly and additively}. Perhaps more interestingly, our result implies the existence
of an optimal choice of $T$ that minimizes the generalization error, and it also naturally
suggests an optimal way to construct the path of intermediate domains so as to minimize the
accumulative path length $T\Delta$ between the source and the target. To corroborate the
implications of our theory, we examine gradual self-training on multiple semi-synthetic and
real datasets, which confirms our findings. We believe our insights provide a path forward
towards the design of future GDA algorithms.
Provable Domain Generalization via Invariant-Feature Subspace Recovery
H. Wang, H. Si, B. Li, H. Zhao
In Proceedings of the 39th International Conference on Machine Learning (ICML
2022)
[abs] [pdf] [code]
Domain generalization asks for models trained on a set of training environments to perform
well on unseen test environments. Recently, a series of algorithms such as Invariant Risk
Minimization (IRM) has been proposed for domain generalization. However, Rosenfeld et al.
(2021) shows that in a simple linear data model, even if non-convexity issues are ignored,
IRM and its extensions cannot generalize to unseen environments with less than ds+1 training
environments, where ds is the dimension of the spurious-feature subspace. In this paper, we
propose to achieve domain generalization with Invariant-feature Subspace Recovery (ISR). Our
first algorithm, ISR-Mean, can identify the subspace spanned by invariant features from the
first-order moments of the class-conditional distributions, and achieve provable domain
generalization with ds+1 training environments under the data model of Rosenfeld et al.
(2021). Our second algorithm, ISR-Cov, further reduces the required number of training
environments to O(1) using the information of second-order moments. Notably, unlike IRM, our
algorithms bypass non-convexity issues and enjoy global convergence guarantees. Empirically,
our ISRs can obtain superior performance compared with IRM on synthetic benchmarks. In
addition, on three real-world image and text datasets, we show that ISR-Mean can be used as
a simple yet effective post-processing method to increase the worst-case accuracy of trained
models against spurious correlations and group shifts.
Rethinking Controllable Variational Autoencoders
H. Shao, Y. Yang, H. Lin, L. Lin, Y. Chen, Q. Yang, H. Zhao
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR
2022)
[abs] [pdf]
The Controllable Variational Autoencoder (ControlVAE) combines automatic control theory with
the basic VAE model to manipulate the KL-divergence for overcoming posterior collapse and
learning disentangled representations. It has shown success in a variety of applications,
such as image generation, disentangled representation learning, and language modeling.
However, when it comes to disentangled representation learning, ControlVAE does not delve
into the rationale behind it. The goal of this paper is to develop a deeper understanding of
ControlVAE in learning disentangled representations, including the choice of a desired
KL-divergence (i.e, set point), and its stability during training. We first fundamentally
explain its ability to disentangle latent variables from an information bottleneck
perspective. We show that KL-divergence is an upper bound of the variational information
bottleneck. By controlling the KL-divergence gradually from a small value to a target value,
ControlVAE can disentangle the latent factors one by one. Based on this finding, we propose
a new DynamicVAE that leverages a modified incremental PI (proportional-integral)
controller, a variant of the proportional-integral-derivative (PID) algorithm, and employs a
moving average as well as a hybrid annealing method to evolve the value of KL-divergence
smoothly in a tightly controlled fashion. In addition, we analytically derive a lower bound
of the set point for disentangling. We then theoretically prove the stability of the
proposed approach. Evaluation results on multiple benchmark datasets demonstrate that
DynamicVAE achieves a good trade-off between the disentanglement and reconstruction quality.
We also discover that it can separate disentangled representation learning and
reconstruction via manipulating the desired KL-divergence.
Cross-Lingual Transfer with Class-weighted Language-Invariant Representations
R. Xian, H. Ji, H. Zhao
In Proceedings of the 10th International Conference on Learning Representations (ICLR
2022)
[abs] [pdf] [code]
Recent advances in neural modeling have produced deep multilingual language models capable
of extracting cross-lingual knowledge from unparallel texts, evidenced by their decent
zero-shot transfer performance. While studies have attributed this success to
cross-lingually shared representations, quantitative analyses are sparse. Towards a better
understanding of the role of multilingual representations, in this work, we first make the
following observations through empirical analysis: (1) invariance of the feature
representations strongly correlates with transfer performance, and (2) distributional shift
in class priors between data in the source and target languages negatively affects
performance -- an issue that is largely overlooked in prior work. Based on our findings, we
propose an unsupervised cross-lingual learning method, called importance-weighted domain
alignment (IWDA), that performs representation alignment, prior shift estimation, and
correction. Experiment results demonstrate its superiority under large prior shifts. In
addition, our method delivers further performance gains when combined with existing
semi-supervised learning techniques.
Conditional Contrastive Learning with Kernel
Y. H. Tsai, T. Li, M. Q. Ma, H. Zhao, K. Zhang, L-P. Morency, R. Salakhutdinov
In Proceedings of the 10th International Conference on Learning Representations (ICLR
2022)
[abs] [pdf] [code]
Conditional contrastive learning frameworks consider the conditional sampling
procedure that constructs positive or negative data pairs conditioned on specific
variables. Fair contrastive learning constructs negative pairs, for example, from the
same gender (conditioning on sensitive information), which in turn reduces undesirable
information from the learned representations; weakly supervised contrastive
learning constructs positive pairs with similar annotative attributes (conditioning
on auxiliary information), which in turn are incorporated into the representations.
Although conditional contrastive learning enables many applications, the conditional
sampling procedure can be challenging if we cannot obtain sufficient data
pairs for some values of the conditioning variable. This paper presents Conditional
Contrastive Learning with Kernel (CCL-K) that converts existing conditional contrastive
objectives into alternative forms that mitigate the insufficient data problem.
Instead of sampling data according to the value of the conditioning variable, CCLK uses the
Kernel Conditional Embedding Operator that samples data from all
available data and assigns weights to each sampled data given the kernel similarity
between the values of the conditioning variable. We conduct experiments using
weakly supervised, fair, and hard negatives contrastive learning, showing CCL-K
outperforms state-of-the-art baselines.
Inherent Tradeoffs in Learning Fair Representations
H. Zhao and G. Gordon
Journal of Machine Learning Research (JMLR 2022)
(Extended version of an earlier paper with the same title appearing in NeurIPS 2019)
[abs]
[JMLR] [arXiv]
Real-world applications of machine learning tools in high-stakes domains are often regulated
to be fair, in the sense that the predicted target should satisfy some quantitative notion
of parity with respect to a protected attribute. However, the exact tradeoff between
fairness and accuracy is not entirely clear, even for the basic paradigm of classification
problems. In this paper, we characterize an inherent tradeoff between statistical parity and
accuracy in the classification setting by providing a lower bound on the sum of group-wise
errors of any fair classifiers. Our impossibility theorem could be interpreted as a certain
uncertainty principle in fairness: if the base rates differ among groups, then any fair
classifier satisfying statistical parity has to incur a large error on at least one of the
groups. We further extend this result to give a lower bound on the joint error of any
(approximately) fair classifiers, from the perspective of learning fair representations. To
show that our lower bound is tight, assuming oracle access to Bayes (potentially unfair)
classifiers, we also construct an algorithm that returns a randomized classifier which is
both optimal (in terms of accuracy) and fair. Interestingly, when the protected attribute
can take more than two values, an extension of this lower bound does not admit an analytic
solution. Nevertheless, in this case, we show that the lower bound can be efficiently
computed by solving a linear program, which we term as the TV-Barycenter problem, a
barycenter problem under the TV-distance.
On the upside, we prove that if the group-wise Bayes optimal classifiers are close, then
learning fair representations leads to an alternative notion of fairness, known as the
accuracy parity, which states that the error rates are close between groups. Finally, we
also conduct experiments on real-world datasets to confirm our theoretical findings.
Towards Return Parity in Markov Decision Processes
J. Chi, J. Shen, X. Dai, W. Zhang, Y. Tian, and H. Zhao
In Proceedings of the 25th International Conference on Artificial Intelligence and Statistics
(AISTATS 2022)
[abs]
[pdf]
Algorithmic decisions made by machine learning models in high-stakes domains may have
lasting impacts over time. Unfortunately, naive applications of standard fairness criterion
in static settings over temporal domains may lead to delayed and adverse effects. To
understand the dynamics of performance disparity, we study a fairness problem in Markov
decision processes (MDPs). Specifically, we propose return parity, a fairness notion that
requires MDPs from different demographic groups that share the same state and action spaces
to achieve approximately the same expected time-discounted rewards.
We first provide a decomposition theorem for return disparity, which decomposes the return
disparity of any two MDPs into the distance between group-wise reward functions, the
discrepancy of group policies, and the discrepancy between state visitation distributions
induced by the group policies.
Motivated by our decomposition theorem, we propose algorithms to mitigate return disparity
via learning a shared group policy with state visitation distributional alignment using
integral probability metrics.
We conduct experiments to corroborate our results, showing that the proposed algorithm can
successfully close the disparity gap while maintaining the performance of policies on two
real-world recommender system benchmark datasets.
Online Continual Adaptation with Active Self-Training
S. Zhou, H. Zhao, S. Zhang, L. Wang, H. Chang, Z. Wang, and W. Zhu
In Proceedings of the 25th International Conference on Artificial Intelligence and Statistics
(AISTATS 2022)
[abs]
[pdf]
Models trained with offline data often suffer from continual distribution shifts and
expensive labeling in changing environments. This calls for a new online learning paradigm
where the learner can continually adapt to changing environments with limited labels. In
this paper, we propose a new online setting -- Online Active Continual Adaptation, where the
learner aims to continually adapt to changing distributions using both unlabeled samples and
active queries of limited labels. To this end, we propose Online Self-Adaptive Mirror
Descent (OSAMD), which adopts an online teacher-student structure to enable online
self-training from unlabeled data, and a margin-based criterion that decides whether to
query the labels to track changing distributions. Theoretically, we show that, in the
separable case, OSAMD has an $O({T}^{1/2})$ dynamic regret bound under mild assumptions,
which is even tighter than the lower bound $\Omega(T^{2/3})$ of traditional online learning
with full labels. In the general case, we show a regret bound of $O({\alpha^*}^{1/3}
{T}^{2/3} + \alpha^* T)$, where $\alpha^*$ denotes the separability of domains and is
usually small. Our theoretical results show that OSAMD can fast adapt to changing
environments with active queries. Empirically, we demonstrate that OSAMD achieves favorable
regrets under changing environments with limited labels on both simulated and real-world
data, which corroborates our theoretical findings.
Invariant Information Bottleneck for Domain Generalization
B. Li, Y. Shen, Y. Wang, W. Zhu, C. J. Reed, J. Zhang, D. Li, K. Keutzer, and H. Zhao
In Proceedings of the 36th AAAI Conference on Artificial Intelligence (AAAI
2022)
[abs]
[pdf] [poster]
Invariant risk minimization (IRM) has recently emerged as a promising alternative for domain
generalization. Nevertheless, the loss function is difficult to optimize for nonlinear
classifiers and the original optimization objective could fail when pseudo-invariant
features and geometric skews exist. Inspired by IRM, in this paper we propose a novel
formulation for domain generalization, dubbed invariant information bottleneck (IIB). IIB
aims at minimizing invariant risks for nonlinear classifiers and simultaneously mitigating
the impact of pseudo-invariant features and geometric skews. Specifically, we first present
a novel formulation for invariant causal prediction via mutual information. Then we adopt
the variational formulation of the mutual information to develop a tractable loss function
for nonlinear classifiers. To overcome the failure modes of IRM, we propose to minimize the
mutual information between the inputs and the corresponding representations. IIB
significantly outperforms IRM on synthetic datasets, where the pseudo-invariant features and
geometric skews occur, showing the effectiveness of proposed formulation in overcoming
failure modes of IRM. Furthermore, experiments on DomainBed show that IIB outperforms $13$
baselines by $0.9\%$ on average across $7$ real datasets.
Quantifying and Improving Transferability in Domain Generalization
G. Zhang, H. Zhao, Y. Yu, and P. Poupart
In Proceedings of the 35th Advances in Neural Information Processing Systems (NeurIPS
2021)
[abs]
[pdf] [poster] [code]
Out-of-distribution generalization is one of the key challenges when transferring a model
from the lab to the real world. Existing efforts mostly focus on building invariant features
among source and target domains. Based on invariant features, a high-performing classifier
on source domains could hopefully behave equally well on a target domain. In other words,
the invariant features are \emph{transferable}. However, in practice, there are no perfectly
transferable features, and some algorithms seem to learn ''more transferable'' features than
others. How can we understand and quantify such \emph{transferability}? In this paper, we
formally define transferability that one can quantify and compute in domain generalization.
We point out the difference and connection with common discrepancy measures between domains,
such as total variation and Wasserstein distance. We then prove that our transferability can
be estimated with enough samples and give a new upper bound for the target error based on
our transferability. Empirically, we evaluate the transferability of the feature embeddings
learned by existing algorithms for domain generalization. Surprisingly, we find that many
algorithms are not quite learning transferable features, although few could still survive.
In light of this, we propose a new algorithm for learning transferable features and test it
over various benchmark datasets, including RotatedMNIST, PACS, Office-Home and WILDS-FMoW.
Experimental results show that the proposed algorithm achieves consistent improvement over
many state-of-the-art algorithms, corroborating our theoretical findings.
EventKE: Event-Enhanced Knowledge Graph Embedding
Z. Zhang, H. Wang, H. Zhao, H. Tong and H. Ji
In Proceedings of the 2021 Empirical Methods in Natural Language Processing
(EMNLP 2021 Findings)
[abs]
[pdf] [code]
Knowledge graph (KG) is a kind of efficient and informative representation of structured
knowledge. A typical KG consists of a collection of knowledge triples, where each triple
$(h, r, t)$ describes that the head entity $h$ and tail entity $t$ are connected through a
relation $r$.
Recently, extensive studies have been focusing on knowledge graph representation learning,
which aims to learn low-dimensional entity and relation embeddings that are informative and
scalable to use for many downstream applications, such as information retrieval~\cite{irkg},
recommendation systems~\cite{recommenderkg}, machine reading
comprehension~\cite{machinereadingkg}, and query-answering systems~\cite{qakg,qakg2}.
Typical KG embedding models, such as \cite{transe,conve,rotate}, usually learn the model
parameters by maximizing pre-defined score functions on ground-truth triples.
One major limitation of such methods is that each knowledge triple is modeled locally and
independently, without considering the global contextual information of KGs.
To solve this problem, another line of approaches~\cite{rgcn,compgcn} manages to model KGs
as heterogeneous networks, and design message passing among entities using graph neural
networks to better utilize global structural information.
Understanding and Mitigating Accuracy Disparity in Regression
J. Chi, Y. Tian, G. Gordon and H. Zhao
In Proceedings of the 38th International Conference on Machine Learning (ICML 2021)
[abs]
[pdf] [code]
With the widespread deployment of large-scale prediction systems in high-stakes domains,
e.g., face recognition, criminal justice, etc., disparity on prediction accuracy between
different demographic subgroups has called for fundamental understanding on the source of
such disparity and algorithmic intervention to mitigate it. In this paper, we study the
accuracy disparity problem in regression. To begin with, we first propose an error
decomposition theorem, which decomposes the accuracy disparity into the distance between
marginal label distributions and the distance between conditional representations, to help
explain why such accuracy disparity appears in practice. Motivated by this error
decomposition and the general idea of distribution alignment with statistical distances, we
then propose an algorithm to reduce this disparity, and analyze its game-theoretic optima of
the proposed objective functions. To corroborate our theoretical findings, we also conduct
experiments on five benchmark datasets. The experimental results suggest that our proposed
algorithms can effectively mitigate accuracy disparity while maintaining the predictive
power of the regression models.
Bridging Multi-Task Learning and Meta-Learning: Towards Efficient Training and Effective
Adaptation
H. Wang, H. Zhao, B. Li
In Proceedings of the 38th International Conference on Machine Learning (ICML 2021)
[abs]
[pdf] [code]
Multi-task learning (MTL) aims to improve the generalization of several related tasks by
learning them jointly. As a comparison, in addition to the joint training scheme, modern
meta-learning allows unseen tasks with limited labels during the test phase, in the hope of
fast adaptation over them. Despite the subtle difference between MTL and meta-learning in
the problem formulation, both learning paradigms share the same insight that the shared
structure between existing training tasks could lead to better generalization and
adaptation. In this paper, we take one important step further to understand the close
connection between these two learning paradigms, through both theoretical analysis and
empirical investigation. Theoretically, we first demonstrate that MTL shares the same
optimization formulation with a class of gradient-based meta-learning (GBML) algorithms. We
then prove that for over-parameterized neural networks with sufficient depth, the learned
predictive functions of MTL and GBML are close. In particular, this result implies that the
predictions given by these two models are similar over the same unseen task. Empirically, we
corroborate our theoretical findings by showing that, with proper implementation, MTL is
competitive against state-of-the-art GBML algorithms on a set of few-shot image
classification benchmarks. Since existing GBML algorithms often involve costly second-order
bi-level optimization, our first-order MTL method is an order of magnitude faster on
large-scale datasets such as mini-ImageNet. We believe this work could help bridge the gap
between these two learning paradigms, and provide a computationally efficient alternative to
GBML that also supports fast task adaptation.
Information Obfuscation of Graph Neural Networks
P. Liao, H. Zhao, K. Xu, T. Jaakkola, G. Gordon, S. Jegelka and R. Salakhutdinov
In Proceedings of the 38th International Conference on Machine Learning (ICML 2021)
[abs]
[pdf] [code]
While the advent of Graph Neural Networks (GNNs) has greatly improved node and graph
representation learning in many applications, the neighborhood aggregation scheme exposes
additional vulnerabilities to adversaries seeking to extract node-level information about
sensitive attributes. In this paper, we study the problem of protecting sensitive attributes
by information obfuscation when learning with graph structured data. We propose a framework
to locally filter out pre-determined sensitive attributes via adversarial training with the
total variation and the Wasserstein distance. Our method creates a strong defense against
inference attacks, while only suffering small loss in task performance. Theoretically, we
analyze the effectiveness of our framework against a worst-case adversary, and characterize
an inherent trade-off between maximizing predictive accuracy and minimizing information
leakage. Experiments across multiple datasets from recommender systems, knowledge graphs and
quantum chemistry demonstrate that the proposed approach provides a robust defense across
various graph structures and tasks, while producing competitive GNN encoders for downstream
tasks.
Learning Invariant Representations and Risks for Semi-supervised Domain Adaptation
B. Li, Y. Wang, S. Zhang, D. Li, T. Darrell, K. Keutzer and H. Zhao
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR
2021)
[abs]
[pdf] [code]
The success of supervised learning hinges on the assumption that the training and test data
come from the same underlying distribution, which is often not valid in practice due to
potential distribution shift. In light of this, most existing methods for unsupervised
domain adaptation focus on achieving domain-invariant representations and small source
domain error. However, recent works have shown that this is not sufficient to guarantee good
generalization on the target domain, and in fact, is provably detrimental under label
distribution shift. Furthermore, in many real-world applications it is often feasible to
obtain a small amount of labeled data from the target domain and use them to facilitate
model training with source data. Inspired by the above observations, in this paper we
propose the first method that aims to simultaneously learn invariant representations and
risks under the setting of semi-supervised domain adaptation (Semi-DA). First, we provide a
finite sample bound for both classification and regression problems under Semi-DA. The bound
suggests a principled way to obtain target generalization, i.e. by aligning both the
marginal and conditional distributions across domains in feature space. Motivated by this,
we then introduce the LIRR algorithm for jointly \textbf{L}earning \textbf{I}nvariant
\textbf{R}epresentations and \textbf{R}isks. Finally, extensive experiments are conducted on
both classification and regression tasks, which demonstrates LIRR consistently achieves
state-of-the-art performance and significant improvements compared with the methods that
only learn invariant representations or invariant risks.
Self-supervised Representation Learning with Relative Predictive Coding
Y. H. Tsai*, M. Q. Ma*, M. Yang, H. Zhao, L-P Morency, and R. Salakhutdinov
In Proceedings of the 9th International Conference on Learning Representations (ICLR
2021)
[abs] [pdf] [video]
This paper introduces Relative Predictive Coding (RPC), a new contrastive representation
learning objective that maintains a good balance among training stability, minibatch size
sensitivity, and downstream task performance. The key to the success of RPC is two-fold.
First, RPC introduces the relative parameters to regularize the objective for boundedness
and low variance. Second, RPC contains no logarithm and exponential score functions, which
are the main cause of training instability in prior contrastive objectives. We empirically
verify the effectiveness of RPC on benchmark vision and speech self-supervised learning
tasks. Lastly, we relate RPC with mutual information (MI) estimation, showing RPC can be
used to estimate MI with low variance.
On Dyadic Fairness: Exploring and Mitigating Bias in Graph Connections
P. Li, Y. Wang, H. Zhao, P. Hong, H. Liu
In Proceedings of the 9th International Conference on Learning Representations (ICLR
2021)
[abs] [pdf] [video]
Disparate impact has raised serious concerns in machine learning applications and its
societal impacts. In response to the need of mitigating discrimination, fairness has been
regarded as a crucial property in algorithmic design. In this work, we study the problem of
disparate impact on graph-structured data. Specifically, we focus on dyadic fairness, which
articulates a fairness concept that a predictive relationship between two instances should
be independent of the sensitive attributes. Based on this, we theoretically relate the graph
connections to dyadic fairness on link predictive scores in learning graph neural networks,
and reveal that regulating weights on existing edges in a graph contributes to dyadic
fairness conditionally. Subsequently, we propose our algorithm, \textbf{FairAdj}, to
empirically learn a fair adjacency matrix with proper graph structural constraints for fair
link prediction, and in the meanwhile preserve predictive accuracy as much as possible.
Empirical validation demonstrates that our method delivers effective dyadic fairness in
terms of various statistics, and at the same time enjoys a favorable fairness-utility
tradeoff.
Model-based Policy Optimization with Unsupervised Model Adaptation
J. Shen, H. Zhao, W. Zhang and Y. Yu
In Proceedings of the 34th Advances in Neural Information Processing Systems (NeurIPS 2020,
Spotlight)
[abs] [pdf] [code] [video]
Model-based reinforcement learning methods learn a dynamics model with real data sampled
from the environment and leverage it to generate simulated data to derive an agent. However,
due to the potential distribution mismatch between simulated data and real data, this could
lead to degraded performance. Despite much effort being devoted to reducing this
distribution mismatch, existing methods fail to solve it explicitly. In this paper, we
investigate how to bridge the gap between real and simulated data due to inaccurate model
estimation for better policy optimization. To begin with, we first derive a lower bound of
the expected return, which naturally inspires a bound maximization algorithm by aligning the
simulated and real data distributions. To this end, we propose a novel model-based
reinforcement learning framework AMPO, which introduces unsupervised model adaptation to
minimize the integral probability metric (IPM) between feature distributions from real and
simulated data. Instantiating our framework with Wasserstein-1 distance gives a practical
model-based approach. Empirically, our approach achieves state-of-the-art performance in
terms of sample efficiency on a range of continuous control benchmark tasks.
Neural Methods for Point-wise Dependency Estimation
Y. H. Tsai, H. Zhao, M. Yamada, L-P. Morency and R. Salakhutdinov
In Proceedings of the 34th Advances in Neural Information Processing Systems (NeurIPS 2020,
Spotlight)
[abs] [pdf] [video]
Since its inception, the neural estimation of mutual information (MI) has demonstrated the
empirical success of modeling expected dependency between high-dimensional random variables.
However, MI is an aggregate statistic and cannot be used to measure point-wise dependency
between different events. In this work, instead of estimating the expected dependency, we
focus on estimating point-wise dependency (PD), which quantitatively measures how likely two
outcomes co-occur. We show that we can naturally obtain PD when we are optimizing MI neural
variational bounds. However, optimizing these bounds is challenging due to its large
variance in practice. To address this issue, we develop two methods (free of optimizing MI
variational bounds): Probabilistic Classifier and Density-Ratio Fitting. We demonstrate the
effectiveness of our approaches in 1) MI estimation, 2) self-supervised representation
learning, and 3) cross-modal retrieval task.
Domain Adaptation with Conditional Distribution Matching and Generalized Label Shift
R. Combes*, H. Zhao*, Y.X. Wang and G. Gordon
In Proceedings of the 34th Advances in Neural Information Processing Systems (NeurIPS
2020)
[abs] [pdf] [code]
[video]
Adversarial learning has demonstrated good performance in the unsupervised domain adaptation
setting, by learning domain-invariant representations. However, recent work has shown
limitations of this approach when label distributions differ between the source and target
domains. In this paper, we propose a new assumption, \textit{generalized label shift}
($\glsa$), to improve robustness against mismatched label distributions. $\glsa$ states
that, conditioned on the label, there exists a representation of the input that is invariant
between the source and target domains. Under $\glsa$, we provide theoretical guarantees on
the transfer performance of any classifier. We also devise necessary and sufficient
conditions for $\glsa$ to hold, by using an estimation of the relative class weights between
domains and an appropriate reweighting of samples. Our weight estimation method could be
straightforwardly and generically applied in existing domain adaptation (DA) algorithms that
learn domain-invariant representations, with small computational overhead. In particular, we
modify three DA algorithms, JAN, DANN and CDAN and evaluate their performance on standard
and artificial DA tasks. Our algorithms outperform the base versions, with vast improvements
for large label distribution mismatches. Our code is available at
\url{https://tinyurl.com/y585xt6j}.
Trade-offs and Guarantees of Adversarial Representation Learning for Information
Obfuscation
H. Zhao*, J. Chi*, Y. Tian and G. Gordon
In Proceedings of the 34th Advances in Neural Information Processing Systems (NeurIPS
2020)
[abs]
[pdf] [video]
[slides] [poster]
Crowdsourced data used in machine learning services might carry sensitive information about
attributes that users do not want to share. Various methods have been proposed to minimize
the potential information leakage of sensitive attributes while maximizing the task
accuracy. However, little is known about the theory behind these methods. In light of this
gap, we develop a novel theoretical framework for attribute obfuscation. Under our
framework, we propose a minimax optimization formulation to protect the given attribute and
analyze its inference guarantees against worst-case adversaries. Meanwhile, there is a
tension between minimizing information leakage and maximizing task accuracy. To understand
this, we prove an information-theoretic lower bound to precisely characterize the
fundamental trade-off between accuracy and information leakage. We conduct experiments on
two real-world datasets to corroborate the inference guarantees and validate the inherent
trade-offs therein. Our results indicate that, among several alternatives, the adversarial
learning approach achieves the best trade-off in terms of attribute obfuscation and accuracy
maximization.
A Review of Single-Source Deep Unsupervised Visual Domain Adaptation
S. Zhao, X. Yue, S. Zhang, B. Li, H. Zhao, B. Wu, R. Krishna, J. Gonzalez, A.
Sangiovanni-Vincentelli, S. and K. Keutzer
IEEE Transactions on Neural Networks and Learning Systems (TNNLS)
[abs] [pdf]
Large-scale labeled training datasets have enabled deep neural networks to excel across a
wide range of benchmark vision tasks. However, in many applications, it is prohibitively
expensive and time-consuming to obtain large quantities of labeled data. To cope with
limited labeled training data, many have attempted to directly apply models trained on a
large-scale labeled source domain to another sparsely labeled or unlabeled target domain.
Unfortunately, direct transfer across domains often performs poorly due to the presence of
domain shift or dataset bias. Domain adaptation is a machine learning paradigm that aims to
learn a model from a source domain that can perform well on a different (but related) target
domain. In this paper, we review the latest single-source deep unsupervised domain
adaptation methods focused on visual tasks and discuss new perspectives for future research.
We begin with the definitions of different domain adaptation strategies and the descriptions
of existing benchmark datasets. We then summarize and compare different categories of
single-source unsupervised domain adaptation methods, including discrepancy-based methods,
adversarial discriminative methods, adversarial generative methods, and
self-supervision-based methods. Finally, we discuss future research directions with
challenges and possible solutions.
On Learning Language-Invariant Representations for Universal Machine Translation
H. Zhao, J. Hu and A. Risteski
In Proceedings of the 37th International Conference on Machine Learning (ICML 2020)
[abs] [pdf] [video]
[slides] [blog]
The goal of universal machine translation is to learn to translate between any pair of
languages, given a corpus of paired translated documents for a small subset of all pairs of
languages. Despite impressive empirical results and an increasing interest in massively
multilingual models, theoretical analysis on translation errors made by such universal
machine translation models is only nascent. In this paper, we formally prove certain
impossibilities of this endeavour in general, as well as prove positive results in the
presence of additional (but natural) structure of data. For the former, we derive a lower
bound on the translation error in the many-to-one translation setting, which shows that any
algorithm aiming to learn shared sentence representations among multiple language pairs has
to make a large translation error on at least one of the translation tasks, if no assumption
on the structure of the languages is made. For the latter, we show that if the paired
documents in the corpus follow a natural encoder-decoder generative process, we can expect a
natural notion of ``generalization'': a linear number of language pairs, rather than
quadratic, suffices to learn a good representation. Our theory also explains what kinds of
connection graphs between pairs of languages are better suited: ones with longer paths
result in worse sample complexity in terms of the total number of documents per language
pair needed. We believe our theoretical insights and implications contribute to the future
algorithmic design of universal machine translation.
Deep Fair Clustering for Visual Learning
P. Li, H. Zhao and H. Liu
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR
2020)
[abs]
[pdf]
Fair clustering aims to hide sensitive attributes during data partition by balancing the
distribution of protected subgroups in each cluster. Existing work attempts to address this
problem by reducing it to a classical balanced clustering with a constraint on the
proportion of protected subgroups of the input space. However, the input space may limit the
clustering performance, and so far only low-dimensional datasets have been considered. In
light of these limitations, in this paper, we propose Deep Fair Clustering (DFC) to learn
fair and clustering-favorable representations for clustering simultaneously. Our approach
could effectively filter out sensitive attributes from representations, and also lead to
representations that are amenable for the following cluster analysis. Theoretically, we show
that our fairness constraint in DFC will not incur much loss in terms of several clustering
metrics. Empirically, we provide extensive experimental demonstrations on four visual
datasets to corroborate the superior performance of the proposed approach over existing fair
clustering and deep clustering methods on both cluster validity and fairness criterion.
DyCRS: Dynamic Interpretable Postoperative Complication Risk Scoring
W. Wang, H. Zhao, H. Zhuang, N. Shah and R. Padman
In Proceedings of The Web Conference 2020 (WWW 2020, Oral)
[abs]
[pdf]
Early identification of patients at risk for postoperative complications can facilitate
timely workups and treatments and improve health outcomes. Currently, a widely-used surgical
risk calculator online web system developed by the American College of Surgeons (ACS) uses
patients’ static features, e.g. gender, age, to assess the risk of postoperative
complications. However, the most crucial signals that reflect the actual postoperative
physical conditions of patients are usually real-time dynamic signals, including the vital
signs of patients (e.g., heart rate, blood pressure) collected from postoperative
monitoring. In this paper, we develop a dynamic postoperative complication risk scoring
framework (DyCRS) to detect the “at-risk” patients in a real-time way based on postoperative
sequential vital signs and static features. DyCRS is based on adaptations of the Hidden
Markov Model (HMM) that captures hidden states as well as observable states to generate a
real-time, probabilistic, complication risk score. Evaluating our model using electronic
health record (EHR) on elective Colectomy surgery from a major health system, we show that
DyCRS significantly outperforms the state-of-the-art ACS calculator and real-time predictors
with 50.16% area under precision-recall curve (AUCPRC) gain on average in terms of detection
effectiveness. In terms of earliness, our DyCRS can predict 15hrs55mins earlier on average
than clinician's diagnosis with the recall of 60% and precision of 55%. Furthermore, Our
DyCRS can extract interpretable patients' stages, which are consistent with previous medical
postoperative complication studies. We believe that our contributions demonstrate
significant promise for developing a more accurate, robust and interpretable postoperative
complication risk scoring system, which can benefit more than 50 million annual surgeries in
the US by substantially lowering adverse events and healthcare costs.
Conditional Learning of Fair Representations
H. Zhao, A. Coston, T. Adel and G. Gordon
In Proceedings of the 8th International Conference on Learning Representations
(ICLR 2020, Spotlight)
NeurIPS 2019 Workshop on Machine Learning with Guarantees (NeurIPS 2019)
[abs]
[pdf] [video] [slides] [code]
We propose a novel algorithm for learning fair representations that can
simultaneously mitigate two notions of disparity among different demographic
subgroups. Two key components underpinning the design of our algorithm are
balanced error rate and conditional alignment of representations. We show
how these two components contribute to ensuring accuracy parity and
equalized false-positive and false-negative rates across groups without
impacting demographic parity. Furthermore, we also demonstrate both in
theory and on two real-world experiments that the proposed algorithm leads
to a better utility-fairness trade-off on balanced datasets compared with
existing algorithms on learning fair representations.
Continual Learning with Adaptive Weights (CLAW)
T. Adel, H. Zhao and R. E. Turner
In Proceedings of the 8th International Conference on Learning Representations
(ICLR 2020)
[abs]
[pdf] [video]
Approaches to continual learning aim to successfully learn a set of related
tasks that arrive in an online manner. Recently, several frameworks have
been developed which enable deep learning to be deployed in this learning
scenario. A key modelling decision is to what extent the architecture should
be shared across tasks. On the one hand, separately modelling each task
avoids catastrophic forgetting but it does not support transfer learning and
leads to large models. On the other hand, rigidly specifying a shared
component and a task-specific part enables task transfer and limits the
model size, but it is vulnerable to catastrophic forgetting and restricts
the form of task-transfer that can occur. Ideally, the network should
adaptively identify which parts of the network to share in a data driven
way. Here we introduce such an approach called Continual Learning with
Adaptive Weights (CLAW), which is based on probabilistic modelling and
variational inference. Experiments show that CLAW achieves state-of-the-art
performance on six benchmarks in terms of overall continual learning
performance, as measured by classification accuracy, and in terms of
addressing catastrophic forgetting.
Inherent Tradeoffs in Learning Fair Representations
H. Zhao and G. Gordon
In Proceedings of the 33rd Advances in Neural Information Processing Systems
(NeurIPS 2019)
[abs]
[pdf] [poster] [slides] [blog]
With the prevalence of machine learning in high-stakes applications,
especially
the ones regulated by anti-discrimination laws or societal norms, it is
crucial
to ensure that the predictive models do not propagate any existing bias or
discrimination. Due to the ability of deep neural nets to learn rich
representations, recent advances in algorithmic fairness have focused on
learning fair representations with adversarial techniques to reduce bias in
data
while preserving utility simultaneously. In this paper, through the lens of
information theory, we provide the first result that quantitatively
characterizes the tradeoff between demographic parity and the joint utility
across different population groups. Specifically, when the base rates differ
between groups, we show that any method aiming to learn fair representations
admits an information-theoretic lower bound on the joint error across these
groups. To complement our negative results, we also prove that if the
optimal
decision functions across different groups are close, then learning fair
representations leads to an alternative notion of fairness, known as the
accuracy parity, which states that the error rates are close between groups.
Finally, our theoretical findings are also confirmed empirically on
real-world
datasets.
Adversarial Privacy Preservation under Attribute Inference Attack
H. Zhao*, J. Chi*, Y. Tian and G. Gordon
In NeurIPS 2019 Workshop on Machine Learning with Guarantees (NeurIPS 2019)
[abs]
[pdf]
With the prevalence of machine learning services, crowdsourced data containing sensitive
information poses substantial privacy challenges. Existing work focusing on protecting
against membership inference attacks under the rigorous framework of differential
privacy are vulnerable to attribute inference attacks. In light of the current gap
between theory and practice, we develop a novel theoretical framework for
privacy-preservation under the attack of attribute inference. Under our framework, we
propose a minimax optimization formulation to protect the given attribute and analyze
its privacy guarantees against arbitrary adversaries. On the other hand, it is clear
that privacy constraint may cripple utility when the protected attribute is correlated
with the target variable. To this end, we also prove an information-theoretic lower
bound to precisely characterize the fundamental trade-off between utility and privacy.
Empirically, we extensively conduct experiments to corroborate our privacy guarantee and
validate the inherent trade-offs in different privacy preservation algorithms. Our
experimental results indicate that the adversarial representation learning approaches
achieve the best trade-off in terms of privacy preservation and utility maximization.
On Learning Invariant Representations for Domain Adaptation
H. Zhao, R. Combes, K. Zhang and G. Gordon
In Proceedings of the 36th International Conference on Machine Learning
(ICML 2019, Long Oral)
[abs]
[pdf] [supplement] [poster] [slides] [blog]
Due to the ability of deep neural nets to learn rich representations, recent
advances in unsupervised domain adaptation have focused on learning
domain-invariant features that achieve a small error on the source domain.
The
hope is that the learnt representation, together with the hypothesis learnt
from
the source domain, can generalize to the target domain. In this paper, we
first
construct a simple counterexample showing that, contrary to common belief,
the
above conditions are not sufficient to guarantee successful domain
adaptation.
In particular, the counterexample (Fig. 1) exhibits \emph{conditional
shift}:
the class-conditional distributions of input features change between source
and
target domains. To give a sufficient condition for domain adaptation, we
propose
a natural and interpretable generalization upper bound that explicitly takes
into account the aforementioned shift. Moreover, we shed new light on the
problem by proving an information-theoretic lower bound on the joint error
of
\emph{any} domain adaptation method that attempts to learn invariant
representations. Our result characterizes a fundamental tradeoff between
learning invariant representations and achieving small joint error on both
domains when the marginal label distributions differ from source to target.
Finally, we conduct experiments on real-world datasets that corroborate our
theoretical findings. We believe these insights are helpful in guiding the
future design of domain adaptation and representation learning algorithms.
Learning Neural Networks with Adaptive Regularization
H. Zhao*, Y. H. Tsai*, R. Salakhutdinov and G. Gordon
In Proceedings of the 33rd Advances in Neural Information Processing Systems
(NeurIPS 2019)
[abs]
[pdf] [poster] [slides] [code]
Feed-forward neural networks can be understood as a combination of an
intermediate representation and a linear hypothesis. While most previous
works
aim to diversify the representations, we explore the complementary direction
by
performing an adaptive and data-dependent regularization motivated by the
empirical Bayes method. Specifically, we propose to construct a
matrix-variate
normal prior (on weights) whose covariance matrix has a Kronecker product
structure. This structure is designed to capture the correlations in neurons
through backpropagation. Under the assumption of this Kronecker
factorization,
the prior encourages neurons to borrow statistical strength from one
another.
Hence, it leads to an adaptive and data-dependent regularization when
training
networks on small datasets. To optimize the model, we present an efficient
block
coordinate descent algorithm with analytical solutions. Empirically, we
demonstrate that the proposed method helps networks converge to local optima
with smaller stable ranks and spectral norms. These properties suggest
better
generalizations and we present empirical results to support this
expectation. We
also verify the effectiveness of the approach on multiclass classification
and
multitask regression problems with various network structures.
Efficient Multitask Feature and Relationship Learning
H. Zhao, O. Stretcu, A. Smola and G. Gordon
In Proceedings of the 35th Conference on Uncertainty in Artificial Intelligence
(UAI 2019)
Also In Learning with Limited Labeled Data: Weak Supervision and Beyond
workshop at
NIPS 2017
[abs]
[pdf]
[supplement]
[poster]
We consider a multitask learning problem, in which several predictors are
learned jointly. Prior research has shown that learning the relations
between
tasks, and between the input features, together with the predictor, can lead
to
better generalization and interpretability, which proved to be useful for
applications in many domains. In this paper, we consider a formulation of
multitask learning that learns the relationships both between tasks and
between
features, represented through a task covariance and a feature covariance
matrix,
respectively. First, we demonstrate that existing methods proposed for this
problem present an issue that may lead to ill-posed optimization. We then
propose an alternative formulation, as well as an efficient algorithm to
optimize it. Using ideas from optimization and graph theory, we propose an
efficient coordinate-wise minimization algorithm that has a closed form
solution
for each block subproblem. Our experiments show that the proposed
optimization
method is orders of magnitude faster than its competitors. We also provide a
nonlinear extension that is able to achieve better generalization than
existing
methods.
On Strategyproof Conference Peer Review
Y. Xu*, H. Zhao*, X. Shi and N. B. Shah
In Proceedings of the 28th International Joint Conference on Artificial
Intelligence (IJCAI 2019)
[abs] [pdf] [supplement]
[Full arXiv version]
We consider peer review under a conference setting where there are conflicts
between the reviewers and the submissions. Under such conflicts, reviewers
can
manipulate their reviews in a strategic manner to influence the final
rankings
of their own papers. Present-day peer-review systems are not designed to
guard
against such strategic behavior, beyond minimal (and insufficient) checks
such
as not assigning a paper to a conflicted reviewer. In this work, we address
this
problem through the lens of social choice, and present a theoretical
framework
for strategyproof and efficient peer review. Given the conflict graph which
satisfies a simple property, we first present and analyze a flexible
framework
for reviewer-assignment and aggregation for the reviews that guarantees not
only
strategyproofness but also a natural efficiency property (unanimity). Our
framework is based on the so-called partitioning method, and can be treated
as a
generalization of this type of method to conference peer review settings. We
then empirically show that the requisite property on the (authorship)
conflict
graph is indeed satisfied in the ICLR-17 submissions data, and further
demonstrate a simple trick to make the partitioning method more practically
appealing under conference peer-review settings. Finally, we complement our
positive results with negative theoretical results where we prove that under
slightly stronger requirements, it is impossible for any algorithm to be
both
strategyproof and efficient.
Active Learning of Strict Partial Orders: A Case Study on Concept
Prerequisite
Relations
C. Liang, J. Ye, H. Zhao, B. Pursel and C. Lee Giles
In Proceedings of the 12th International Conference on Educational Data Mining
(EDM 2019)
[abs] [pdf]
Strict partial order is a mathematical structure commonly seen in relational
data. One obstacle to extracting such type of relations at scale is the lack
of
large scale labels for building effective data-driven solutions. We develop
an
active learning framework for mining such relations subject to a strict
order.
Our approach incorporates relational reasoning not only in finding new
unlabeled
pairs whose labels can be deduced from an existing label set, but also in
devising new query strategies that consider the relational structure of
labels.
Our experiments on concept prerequisite relations show our proposed
framework
can substantially improve the classification performance with the same query
budget compared to other baseline approaches.
Deep Generative and Discriminative Domain Adaptation
H. Zhao, J. Hu, Z. Zhu, A. Coates and G. Gordon
In Proceedings of the 18th International Conference on Autonomous Agents and
Multiagent Systems (AAMAS 2019)
[abs]
[pdf] [poster]
The ability to adapt to and learn from different domains and environments is
crucial for agents to generalize. In this paper we propose a probabilistic
framework for domain adaptation that blends both generative and
discriminative
modeling in a principled way. Under this framework, generative and
discriminative models correspond to specific choices of the prior over
parameters. By maximizing both the marginal and the conditional
log-likelihoods,
our models can use both labeled instances from the source domain as well as
unlabeled instances from \emph{both} source and target domains. We show that
the
popular reconstruction loss of autoencoder corresponds to an upper bound of
the
negative marginal log-likelihoods of unlabeled instances, and give a
generalization bound that explicitly incorporates it into the analysis. We
instantiate our framework using neural networks, and build a concrete model,
DAuto.
Adversarial Multiple Source Domain Adaptation
H. Zhao*, S. Zhang*, G. Wu, J. Costeira, J. Moura and G. Gordon
In Proceedings of the 32nd Advances in Neural Information Processing Systems
(NeurIPS 2018)
[abs] [pdf] [supplement] [poster] [code]
While domain adaptation has been actively researched, most algorithms focus
on
the single-source-single-target adaptation setting. In this paper we propose
new
generalization bounds and algorithms under both classification and
regression
settings for unsupervised multiple source domain adaptation. Our theoretical
analysis naturally leads to an efficient learning strategy using adversarial
neural networks: we show how to interpret it as learning feature
representations
that are invariant to the multiple domain shifts while still being
discriminative for the learning task. To this end, we propose multisource
domain
adversarial networks (MDAN) that approach domain adaptation by optimizing
task-adaptive generalization bounds. To demonstrate the effectiveness of
MDAN,
we conduct extensive experiments showing superior adaptation performance on
both
classification and regression problems: sentiment analysis, digit
classification, and vehicle counting.
Frank-Wolfe Optimization for Symmetric-NMF under Simplicial Constraint
H. Zhao and G. Gordon
In Proceedings of the 34th Conference on Uncertainty in Artificial Intelligence
(UAI 2018)
[abs] [pdf] [supplement]
Symmetric nonnegative matrix factorization has found abundant applications
in
various domains by providing a symmetric low-rank decomposition of
nonnegative
matrices. In this paper we propose a Frank-Wolfe (FW) solver to optimize the
symmetric nonnegative matrix factorization problem under a simplicial
constraint, which has recently been proposed for probabilistic clustering.
Compared with existing solutions, this algorithm is simple to implement, and
has
no hyperparameters to be tuned. Building on the recent advances of FW
algorithms
in nonconvex optimization, we prove an $O(1/\eps^2)$ convergence rate to
$\eps$-approximate KKT points, via a tight bound $\Theta(n^2)$ on the
curvature
constant, which matches the best known result in unconstrained nonconvex
setting
using gradient methods. Numerical results demonstrate the effectiveness of
our
algorithm. As a side contribution, we construct a simple nonsmooth convex
problem where the FW algorithm fails to converge to the optimum. This result
raises an interesting question about necessary conditions of the success of
the
FW algorithm on convex problems.
Convolutional-Recurrent Neural Networks for Speech Enhancement
H. Zhao, S. Zarar, I. Tashev and C. H. Lee
IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP
2018, Oral)
[abs]
[pdf] [slides]
We propose an end-to-end model based on convolutional and recurrent neural
networks for speech enhancement. Our model is purely data-driven and does
not
make any assumptions about the type or the stationarity of the noise. In
contrast to existing methods that use multilayer perceptrons (MLPs), we
employ
both convolutional and recurrent neural network architectures. Thus, our
approach allows us to exploit local structures in both the frequency and
temporal domains. By incorporating prior knowledge of speech signals into
the
design of model structures, we build a model that is more data-efficient and
achieves better generalization on both seen and unseen noise. Based on
experiments with synthetic data, we demonstrate that our model outperforms
existing methods, improving PESQ by up to 0.6 on seen noise and 0.64 on
unseen
noise.
Approximate Empirical Bayes for Deep Neural Networks
H. Zhao*, Y. H. Tsai*, R. Salakhutdinov and G. Gordon
In Uncertainty in Deep Learning workshop at UAI (UAI UDL 2018)
[abs]
[pdf] [poster]
We propose an approximate empirical Bayes framework and an efficient algorithm for
learning the weight matrix of deep neural networks. Empirically, we show the proposed
method works as a regularization approach that helps generalization when training neural
networks on small datasets.
Multiple Source Domain Adaptation with Adversarial Learning
H. Zhao*, S. Zhang*, G. Wu, J. Costeira, J. Moura and G. Gordon
In 6th International Conference on Learning Representations (ICLR 2018 workshop
track)
[abs]
[pdf] [poster]
While domain adaptation has been actively researched in recent years, most theoretical
results and algorithms focus on the single-source-single-target adaptation setting.
Naive application of such algorithms on multiple source domain adaptation problem may
lead to suboptimal solutions. We propose a new generalization bound for domain
adaptation when there are multiple source domains with labeled instances and one target
domain with unlabeled instances. Compared with existing bounds, the new bound does not
require expert knowledge about the target distribution, nor the optimal combination rule
for multisource domains. Interestingly, our theory also leads to an efficient learning
strategy using adversarial neural networks: we show how to interpret it as learning
feature representations that are invariant to the multiple domain shifts while still
being discriminative for the learning task. To this end, we propose two models, both of
which we call multisource domain adversarial networks (MDANs): the first model optimizes
directly our bound, while the second model is a smoothed approximation of the first one,
leading to a more data-efficient and task-adaptive model. The optimization tasks of both
models are minimax saddle point problems that can be optimized by adversarial training.
To demonstrate the effectiveness of MDANs, we conduct extensive experiments showing
superior adaptation performance on three real-world datasets: sentiment analysis, digit
classification, and vehicle counting.
Linear Time Computation of Moments in Sum-Product Networks
H. Zhao and G. Gordon
In Proceedings of the 31st Advances in Neural Information Processing Systems
(NIPS 2017)
[abs] [pdf] [poster]
Bayesian online algorithms for Sum-Product Networks (SPNs) need to update
their
posterior distribution after seeing one single additional instance. To do
so,
they must compute moments of the model parameters under this distribution.
The
best existing method for computing such moments scales quadratically in the
size
of the SPN, although it scales linearly for trees. This unfortunate scaling
makes Bayesian online algorithms prohibitively expensive, except for small
or
tree-structured SPNs. We propose a linear-time algorithm that works even
when
the SPN is a general directed acyclic graph (DAG). Our algorithm
significantly
broadens the applicability of Bayesian online algorithms for SPNs. There are
three key ingredients in the design and analysis of our algorithm: 1). For
each
edge in the graph, we find a linear time reduction from the moment
computation
problem to a joint inference problem in SPNs. 2). Using the property that
each
SPN computes a multilinear polynomial, we construct an efficient procedure
for
polynomial evaluation by differentiation without expanding the network that
may
contain exponentially many positive monomials. 3). We propose a dynamic
programming method to further reduce the computation of the moments of all
the
edges in the graph from quadratic to linear. We demonstrate the usefulness
of
our linear time moment computation algorithm by applying it to develop a
linear
time assume density filter (ADF) for SPNs.
Unsupervised Domain Adaptation with a Relaxed Covariate Shift Assumption
T. Adel, H. Zhao and A. Wong
In Proceedings of the 31th AAAI Conference on Artificial Intelligence (AAAI
2017)
[abs] [pdf]
Domain adaptation addresses learning tasks where training is performed on
data
from one domain whereas testing is performed on data belonging to a
different
but related domain. Assumptions about the relationship between the source
and
target domains should lead to tractable solutions on the one hand, and be
realistic on the other hand. Here we propose a generative domain adaptation
model that allows for modeling different assumptions about this
relationship,
among which is a newly introduced assumption that replaces covariate shift
with
a possibly more realistic assumption without losing tractability due to the
efficient variational inference procedure developed. In addition to the
ability
to model less restrictive relationships between source and target, modeling
can
be performed without any target labeled data (unsupervised domain
adaptation).
We also provide a Rademacher complexity bound of the proposed algorithm. We
evaluate the model on the Amazon reviews and the CVC pedestrian detection
datasets.
Discovering Order in Unordered Datasets: Generative Markov Networks
Y. H. Tsai, H. Zhao, R. Salakhutdinov and N. Jojic
In Time Series workshop at NIPS (NIPS TSW 2017)
[abs]
[pdf] [slides] [poster]
The assumption that data samples are independently identically distributed is the
backbone of many learning algorithms. Nevertheless, datasets often exhibit rich
structures in practice, and we argue that there exist some unknown orders within the
data instances. Aiming to find such orders, we introduce a novel Generative Markov
Network (GMN) which we use to extract the order of data instances automatically.
Specifically, we assume that the instances are sampled from a Markov chain. Our goal is
to learn the transitional operator of the chain as well as the generation order by
maximizing the generation probability under all possible data permutations. One of our
key ideas is to use neural networks as a soft lookup table for approximating the
possibly huge, but discrete transition matrix. This strategy allows us to amortize the
space complexity with a single model and make the transitional operator generalizable to
unseen instances. To ensure the learned Markov chain is ergodic, we propose a greedy
batch-wise permutation scheme that allows fast training. Empirically, we evaluate the
learned Markov chain by showing that GMNs are able to discover orders among data
instances and also perform comparably well to state-of-the-art methods on the one-shot
recognition benchmark task.
A Unified Approach for Learning the Parameters of Sum-Product Networks
H. Zhao, P. Poupart and G. Gordon
In Proceedings of the 30th Advances in Neural Information Processing Systems
(NIPS 2016)
[abs] [pdf] [supplement] [poster] [code]
We present a unified approach for learning the parameters of Sum-Product
networks (SPNs). We prove that any complete and decomposable SPN is
equivalent
to a mixture of trees where each tree corresponds to a product of univariate
distributions. Based on the mixture model perspective, we characterize the
objective function when learning SPNs based on the maximum likelihood
estimation
(MLE) principle and show that the optimization problem can be formulated as
a
signomial program. Both the projected gradient descent (PGD) and the
exponentiated gradient (EG) in this setting can be viewed as first order
approximations of the signomial program after proper transformation of the
objective function. Based on the signomial program formulation, we construct
two
parameter learning algorithms for SPNs by using sequential monomial
approximations (SMA) and the concave-convex procedure (CCCP), respectively.
The
two proposed methods naturally admit multiplicative updates, hence
effectively
avoiding the projection operation. With the help of the unified framework,
we
also show that, in the case of SPNs, CCCP leads to the same algorithm as
Expectation Maximization (EM) despite the fact that they are different in
general. Extensive experiments on 20 data sets demonstrate the effectiveness
and
efficiency of the two proposed approaches for learning SPNs. We also show
that
the proposed methods can improve the performance of structure learning and
yield
state-of-the-art results.
Collapsed Variational Inference for Sum-Product Networks
H. Zhao, T. Adel, G. Gordon and B. Amos
In Proceedings of the 33rd International Conference on Machine Learning
(ICML
2016)
[abs] [pdf] [poster] [slides] [code]
Sum-Product Networks (SPNs) are probabilistic inference machines that admit
exact inference in linear time in the size of the network. Existing
parameter
learning approaches for SPNs are largely based on the maximum likelihood
principle and hence are subject to overfitting compared to more Bayesian
approaches. Exact Bayesian posterior inference for SPNs is computationally
intractable. Both standard variational inference and posterior sampling for
SPNs
are computationally infeasible even for networks of moderate size due to the
large number of local latent variables per instance. In this work, we
propose a
novel deterministic collapsed variational inference algorithm for SPNs that
is
computationally efficient, easy to implement and at the same time allows us
to
incorporate prior information into the optimization formulation. Extensive
experiments show a significant improvement in accuracy compared with a
maximum
likelihood based approach.
Online Algorithms for Sum-Product Networks with Continuous Variables
P. Jaini, A. Rashwan, H. Zhao, Y. Liu, E. Banijamali, Z. Chen and P.
Poupart
In Proceedings of the 8th International Conference on Probabilistic Graphical
Models (PGM 2016)
[abs] [pdf]
Sum-product networks (SPNs) have recently emerged as an attractive
representation due to their dual interpretation as a special type of deep
neural
network with clear semantics and a tractable probabilistic graphical model.
We
explore online algorithms for parameter learning in SPNs with continuous
variables. More specifically, we consider SPNs with Gaussian leaf
distributions
and show how to derive an online Bayesian moment matching algorithm to learn
from streaming data. We compare the resulting generative models to stacked
restricted Boltzmann machines and generative moment matching networks on
real-world datasets.
Online and Distributed Bayesian Moment Matching for Parameter Learning in
Sum-Product Networks
A. Rashwan, H. Zhao and P. Poupart
In Proceedings of the 19th International Conference on Artificial Intelligence
and
Statistics (AISTATS 2016)
[abs]
[pdf]
Probabilistic graphical models provide a general and flexible framework for
reasoning about complex dependencies in noisy domains with many variables.
Among
the various types of probabilistic graphical models, sum-product networks
(SPNs)
have recently generated some interest because exact inference can always be
done
in linear time with respect to the size of the network. This is particularly
attractive since it means that learning an SPN from data always yields a
tractable model for inference. However, existing parameter learning
algorithms
for SPNs operate in batch mode and do not scale easily to large datasets. In
this work, we explore online algorithms to ensure that parameter learning
can
also be done tractably with respect to the amount of data. More
specifically, we
propose a new Bayesian moment matching (BMM) algorithm that operates
naturally
in an online fashion and that can be easily distributed. We demonstrate the
effectiveness and scalability of BMM in comparison to online extensions of
gradient descent and expectation maximization on 20 classic benchmarks and 4
large scale datasets.
On the Relationship between Sum-Product Networks and Bayesian
Networks
H. Zhao, M. Melibari and P. Poupart
In Proceedings of the 32nd International Conference on Machine Learning
(ICML
2015)
[abs] [pdf] [supplement] [Full arXiv version] [slides] [poster]
In this paper, we establish some theoretical connections between Sum-Product
Networks (SPNs) and Bayesian Networks (BNs). We prove that every SPN can be
converted into a BN in linear time and space in terms of the network size.
The
key insight is to use Algebraic Decision Diagrams (ADDs) to compactly
represent
the local conditional probability distributions at each node in the
resulting BN
by exploiting context-specific independence (CSI). The generated BN has a
simple
directed bipartite graphical structure. We show that by applying the
Variable
Elimination algorithm (VE) to the generated BN with ADD representations, we
can
recover the original SPN where the SPN can be viewed as a history record or
caching of the VE inference process. To help state the proof clearly, we
introduce the notion of {\em normal} SPN and present a theoretical analysis
of
the consistency and decomposability properties. We conclude the paper with
some
discussion of the implications of the proof and establish a connection
between
the depth of an SPN and a lower bound of the tree-width of its corresponding
BN.
Self-Adaptive Hierarchical Sentence Model
H. Zhao, Z. Lu and P. Poupart
In Proceedings of the 24th International Joint Conference on Artificial
Intelligence (IJCAI 2015)
[abs]
[pdf]
[slides]
[poster]
[code]
The ability to accurately model a sentence at varying stages (e.g.,
word-phrase-sentence) plays a central role in natural language processing.
As an
effort towards this goal we propose a self-adaptive hierarchical sentence
model
(AdaSent). AdaSent effectively forms a hierarchy of representations from
words
to phrases and then to sentences through recursive gated local composition
of
adjacent segments. We design a competitive mechanism (through gating
networks)
to allow the representations of the same sentence to be engaged in a
particular
learning task (e.g., classification), therefore effectively mitigating the
gradient vanishing problem persistent in other recursive models. Both
qualitative and quantitative analysis shows that AdaSent can automatically
form
and select the representations suitable for the task at hand during
training,
yielding superior classification performance over competitor models on 5
benchmark data sets.
SoF: Soft-Cluster Matrix Factorization for Probabilistic Clustering
H. Zhao, P. Pouart, Y. Zhang and M. Lysy
In Proceedings of the 29th AAAI Conference on Artificial Intelligence (AAAI
2015)
[abs] [pdf]
[poster]
We propose SoF (Soft-cluster matrix Factorization), a probabilistic
clustering
algorithm which softly assigns each data point into clusters. Unlike
model-based
clustering algorithms, SoF does not make assumptions about the data density
distribution. Instead, we take an axiomatic approach to define 4 properties
that
the probability of co-clustered pairs of points should satisfy. Based on the
properties, SoF utilizes a distance measure between pairs of points to
induce
the conditional co-cluster probabilities. The objective function in our
framework establishes an important connection between probabilistic
clustering
and constrained symmetric Nonnegative Matrix Factorization (NMF), hence
providing a theoretical interpretation for NMF-based clustering algorithms.
To
optimize the objective, we derive a sequential minimization algorithm using
a
penalty method. Experimental results on both synthetic and real-world
datasets show that SoF significantly outperforms previous NMF-based algorithms and
that
it is able to detect non-convex patterns as well as cluster boundaries.
Global Network Alignment in the Context Of Aging
F. Faisal, H. Zhao and T. Milenkovic
IEEE/ACM Transactions on Computational Biology and Bioinformatics (IEEE/ACM
TCBB
2014)
In Proceedings of the 4th ACM International Conference on Bioinformatics,
Computational Biology and Biomedicine (ACM BCB 2013)
[abs] [pdf]
[supplement]
Analogous to sequence alignment, network alignment (NA) can be used to
transfer
biological knowledge across species between conserved network regions. NA
faces
two algorithmic challenges: 1) Which cost function to use to capture
“similarities” between nodes in different networks? 2) Which alignment
strategy
to use to rapidly identify “high-scoring” alignments from all possible
alignments? We “break down” existing state-of-the-art methods that use both
different cost functions and different alignment strategies to evaluate each
combination of their cost functions and alignment strategies. We find that a
combination of the cost function of one method and the alignment strategy of
another method beats the existing methods. Hence, we propose this
combination as
a novel superior NA method. Then, since human aging is hard to study
experimentally due to long lifespan, we use NA to transfer aging-related
knowledge from well annotated model species to poorly annotated human. By
doing
so, we produce novel human aging-related knowledge, which complements
currently
available knowledge about aging that has been obtained mainly by sequence
alignment. We demonstrate significant similarity between topological and
functional properties of our novel predictions and those of known
aging-related
genes. We are the first to use NA to learn more about aging.
A Sober Look at Spectral Learning
H. Zhao and P. Poupart
In Method of Moments and Spectral Learning workshop at ICML (ICML MM 2014)
[abs]
[pdf]
[slides]
[poster]
[code]
Spectral learning recently generated lots of excitement in machine learning, largely
because it is the first known method to produce consistent estimates (under suitable
conditions) for several latent variable models. In contrast, maximum likelihood
estimates may get trapped in local optima due to the non-convex nature of the likelihood
function of latent variable models. In this paper, we do an empirical evaluation of
spectral learning (SL) and expectation maximization (EM), which reveals an important gap
between the theory and the practice. First, SL often leads to negative probabilities.
Second, EM often yields better estimates than spectral learning and it does not seem to
get stuck in local optima. We discuss how the rank of the model parameters and the
amount of training data can yield negative probabilities. We also question the common
belief that maximum likelihood estimators are necessarily inconsistent.
Y. Hu, F. Wu, H. Ye, D. Forsyth, J. Zou, N. Jiang, J. W. Ma, H. Zhao
In Proceedings of the 39th Advances in Neural Information Processing Systems (NeurIPS 2025, Oral)
[abs] [pdf]
Online reinforcement learning (RL) excels in complex, safety-critical domains, yet it faces challenges such as sample inefficiency, training instability, and a lack of interpretability. Data attribution offers a principled way to trace model behavior back to individual training samples. However, in online RL, each training sample not only drives policy updates but also influences future data collection, violating the fixed dataset assumption in existing attribution methods. In this paper, we initiate the study of data attribution for online RL, focusing on the widely used Proximal Policy Optimization (PPO) algorithm. We start by establishing a local attribution framework, interpreting model checkpoints with respect to the records in the recent training buffer. We design two target functions, capturing agent action and cumulative return respectively, and measure each record's contribution through gradient similarity between its training loss and these targets. We demonstrate the power of this framework through three concrete applications: diagnosis of learning, temporal analysis of behavior formation, and targeted intervention during training. Leveraging this framework, we further propose an algorithm, iterative influence-based filtering (IIF), for online RL training that iteratively performs experience filtering to refine policy updates. Across standard RL benchmarks (classic control, navigation, locomotion) to RLHF for large language models, IIF reduces sample complexity, speeds up training, and achieves higher returns. Overall, these results advance interpretability, efficiency, and effectiveness of online RL.
W. Chen, H. Zhao
In Proceedings of the 39th Advances in Neural Information Processing Systems (NeurIPS 2025)
[abs] [pdf] [code]
Neural Probabilistic Circuits (NPCs), a new class of concept bottleneck models, comprise an attribute recognition model and a probabilistic circuit for reasoning. By integrating the outputs from these two modules, NPCs produce compositional and interpretable predictions. While offering enhanced interpretability and high performance on downstream tasks, the neural-network-based attribute recognition model remains a black box. This vulnerability allows adversarial attacks to manipulate attribute predictions by introducing carefully crafted subtle perturbations to input images, potentially compromising the final predictions. In this paper, we theoretically analyze the adversarial robustness of NPC and demonstrate that it only depends on the robustness of the attribute recognition model and is independent of the robustness of the probabilistic circuit. Moreover, we propose RNPC, the first robust neural probabilistic circuit against adversarial attacks on the recognition module. RNPC introduces a novel class-wise integration for inference, ensuring a robust combination of outputs from the two modules. Our theoretical analysis demonstrates that RNPC exhibits provably improved adversarial robustness compared to NPC. Empirical results on image classification tasks show that RNPC achieves superior adversarial robustness compared to existing concept bottleneck models while maintaining high accuracy on benign inputs.
W. Chen, S. Yu, H. Shao, L. Sha, H. Zhao
arXiv preprint
[abs] [pdf] [code]
End-to-end deep neural networks have achieved remarkable success across various domains but are often criticized for their lack of interpretability. While post hoc explanation methods attempt to address this issue, they often fail to accurately represent these black-box models, resulting in misleading or incomplete explanations. To overcome these challenges, we propose an inherently transparent model architecture called Neural Probabilistic Circuits (NPCs), which enable compositional and interpretable predictions through logical reasoning. In particular, an NPC consists of two modules: an attribute recognition model, which predicts probabilities for various attributes, and a task predictor built on a probabilistic circuit, which enables logical reasoning over recognized attributes to make class predictions. To train NPCs, we introduce a three-stage training algorithm comprising attribute recognition, circuit construction, and joint optimization. Moreover, we theoretically demonstrate that an NPC's error is upper-bounded by a linear combination of the errors from its modules. To further demonstrate the interpretability of NPC, we provide both the most probable explanations and the counterfactual explanations. Empirical results on four benchmark datasets show that NPCs strike a balance between interpretability and performance, achieving results competitive even with those of end-to-end black-box models while providing enhanced interpretability.
S. Zhou, T. Yu, Z. Zhang, H. Chang, X. Zhou, D. Wu, H. Zhao
In Proceedings of the 39th Advances in Neural Information Processing Systems (NeurIPS 2025)
[abs] [pdf]
Machine unlearning (MU) aims to efficiently remove sensitive or harmful memory from a pre-trained model. The key challenge is to balance the potential tradeoff between unlearning efficacy and utility preservation, which involves forgetting undesirable information as defined while maintaining the model's original performance. One potential way to tackle this problem is to use multi-objective optimization to jointly optimize both the unlearning and utility preservation objectives. However, existing multi-objective methods only guarantee finding a Pareto-optimal solution without fine-grained control, which causes under-optimization of the unlearning objective. To this end, we first model MU as a constrained optimization problem, that is, optimizing the unlearning objective under the constraint of a bounded increase for utility loss. We then show that solving this optimization problem is equivalent to unilateral gradient surgery on the unlearning objective. To resolve the additional computational cost brought by gradient surgery, we propose an implicit gradient surgery method, which approximates the solution to the aforementioned constrained optimization problem via only one backpropagation, thereby achieving efficient utility-preserving MU. Theoretically, we provide a tight convergence analysis of the algorithm. Empirically, our extensive experiments show that the proposed algorithm achieves better tradeoff results than existing baselines. Theoretically, we provide a tight convergence analysis of the algorithm. Empirically, our extensive experiments show that the proposed algorithm achieves better tradeoff results than existing baselines.
W. Wang, J. Deng, Y. Hu, S. Zhang, X. Jiang, R. Zhang, H. Zhao, J. W. Ma
In Proceedings of the 39th Advances in Neural Information Processing Systems (NeurIPS 2025)
[abs] [pdf]
Data attribution methods, which quantify the influence of individual training data points on a machine learning model, have gained increasing popularity in data-centric applications in modern AI. Despite a recent surge of new methods developed in this space, the impact of hyperparameter tuning in these methods remains under-explored. In this work, we present the first large-scale empirical study to understand the hyperparameter sensitivity of common data attribution methods. Our results show that most methods are indeed sensitive to certain key hyperparameters. However, unlike typical machine learning algorithms -- whose hyperparameters can be tuned using computationally-cheap validation metrics -- evaluating data attribution performance often requires retraining models on subsets of training data, making such metrics prohibitively costly for hyperparameter tuning. This poses a critical open challenge for the practical application of data attribution methods. To address this challenge, we advocate for better theoretical understandings of hyperparameter behavior to inform efficient tuning strategies. As a case study, we provide a theoretical analysis of the regularization term that is critical in many variants of influence function methods. Building on this analysis, we propose a lightweight procedure for selecting the regularization value without model retraining, and validate its effectiveness across a range of standard data attribution benchmarks. Overall, our study identifies a fundamental yet overlooked challenge in the practical application of data attribution, and highlights the importance of careful discussion on hyperparameter selection in future method development.
P. Hu, J. Melkonian, W. Tang, H. Zhao, J. W. Ma
In Proceedings of the 39th Advances in Neural Information Processing Systems (NeurIPS 2025)
[abs] [pdf] [code]
Gradient-based data attribution methods, such as influence functions, are critical for understanding the impact of individual training samples without requiring repeated model retraining. However, their scalability is often limited by the high computational and memory costs associated with per-sample gradient computation. In this work, we propose GraSS, a novel gradient compression algorithm and its variants FactGraSS for linear layers specifically, that explicitly leverage the inherent sparsity of per-sample gradients to achieve sub-linear space and time complexity. Extensive experiments demonstrate the effectiveness of our approach, achieving substantial speedups while preserving data influence fidelity. In particular, FactGraSS achieves up to 165% faster throughput on billion-scale models compared to the previous state-of-the-art baselines. Our code is publicly available at https://github.com/TRAIS-Lab/GraSS.
Y. He, S. Zeng, Y. Hu, R. Yang, T. Zhang, and H. Zhao
In Proceedings of the 39th Advances in Neural Information Processing Systems, Track on Datasets and Benchmarks (NeurIPS 2025, D&B Track)
[abs] [pdf] [project page] [code] [Hugging Face models]
Model merging provides a scalable alternative to multi-task training by combining specialized finetuned models through parameter arithmetic, enabling efficient deployment without the need for joint training or access to all task data. While recent methods have shown promise, existing evaluations are limited in both model scale and task diversity, leaving open questions about their applicability to large, domain-specialized LLMs. To tackle the challenges, we introduce MergeBench, a comprehensive evaluation suite designed to assess model merging at scale. MergeBench builds on state-of-the-art open-source language models, including Llama and Gemma families at 2B to 9B scales, and covers five key domains: instruction following, mathematics, multilingual understanding, coding and safety. We standardize finetuning and evaluation protocols, and assess eight representative merging methods across multi-task performance, forgetting and runtime efficiency. Based on extensive experiments, we provide practical guidelines for algorithm selection and share insights showing that model merging tends to perform better on stronger base models, with techniques such as merging coefficient tuning and sparsification improving knowledge retention. However, several challenges remain, including the computational cost on large models, the gap for in-domain performance compared to multi-task models, and the underexplored role of model merging in standard LLM training pipelines. We hope MergeBench provides a foundation for future research to advance the understanding and practical application of model merging. Our project page is at https://github.com/uiuctml/MergeBench.
J. Shen, J. Yao, R. Yang, Y. Sun, F. Luo, R. Pan, T. Zhang, H. Zhao
In Proceedings of the Association for Computational Linguistics: EMNLP (EMNLP 2025, Outstanding Paper Award)
[abs] [pdf] [code] [poster]
Reward modeling is a key step in building safe foundation models when applying reinforcement learning from human feedback (RLHF) to align Large Language Models (LLMs). However, reward modeling based on the Bradley-Terry (BT) model assumes a global reward function, failing to capture the inherently diverse and heterogeneous human preferences. Hence, such oversimplification limits LLMs from supporting personalization and pluralistic alignment. Theoretically, we show that when human preferences follow a mixture distribution of diverse subgroups, a single BT model has an irreducible error. While existing solutions, such as multi-objective learning with fine-grained annotations, help address this issue, they are costly and constrained by predefined attributes, failing to fully capture the richness of human values. In this work, we introduce MiCRo, a two-stage framework that enhances personalized preference learning by leveraging large-scale binary preference datasets without requiring explicit fine-grained annotations. In the first stage, MiCRo introduces context-aware mixture modeling approach to capture diverse human preferences. In the second stage, MiCRo integrates an online routing strategy that dynamically adapts mixture weights based on specific context to resolve ambiguity, allowing for efficient and scalable preference adaptation with minimal additional supervision. Experiments on multiple preference datasets demonstrate that MiCRo effectively captures diverse human preferences and significantly improves downstream personalization.
Y. Chen, H. Si, G. Zhang, H. Zhao
In Proceedings of the 41st conference on Uncertainty in Artificial Intelligence (UAI 2025)
[abs] [pdf]
Domain generalization (DG) seeks to develop models that generalize well to unseen target domains, addressing the prevalent issue of distribution shifts in real-world applications. One line of research in DG focuses on aligning domain-level gradients and Hessians to enhance generalization. However, existing methods are computationally inefficient and the underlying principles of these approaches are not well understood. In this paper, we develop the theory of moment alignment for DG. Grounded in \textit{transfer measure}, a principled framework for quantifying generalizability between two domains, we first extend the definition of transfer measure to domain generalization that includes multiple source domains and establish a target error bound. Then, we prove that aligning derivatives across domains improves transfer measure both when the feature extractor induces an invariant optimal predictor across domains and when it does not. Notably, moment alignment provides a unifying understanding of Invariant Risk Minimization, gradient matching, and Hessian matching, three previously disconnected approaches to DG. We further connect feature moments and derivatives of the classifier head, and establish the duality between feature learning and classifier fitting. Building upon our theory, we introduce \textbf{C}losed-Form \textbf{M}oment \textbf{A}lignment (CMA), a novel DG algorithm that aligns domain-level gradients and Hessians in closed-form. Our method overcomes the computational inefficiencies of existing gradient and Hessian-based techniques by eliminating the need for repeated backpropagation or sampling-based Hessian estimation. We validate the efficacy of our approach through two sets of experiments: linear probing and full fine-tuning. CMA demonstrates superior performance in both settings compared to Empirical Risk Minimization and state-of-the-art algorithms.
X. Zhang, P. Li, Y. Yu, Y. Zhang, H. Zhao, Q. Zhang
In Proceedings of the 42nd International Conference on Machine Learning (ICML 2025)
[abs] [pdf]
Distribution matching is a core concept in machine learning, with applications in generative models, domain adaptation, and algorithmic fairness. A closely related but less explored challenge is generating a distribution that aligns with multiple underlying distributions, often with conflicting objectives, known as a Pareto optimal distribution. In this paper, we develop a general theory based on information geometry to construct the Pareto set and front for the entire exponential family under KL and inverse KL divergences. This formulation allows explicit derivation of the Pareto set and front for multivariate normal distributions, enabling applications like multiobjective variational autoencoders (MOVAEs) to generate interpolated image distributions. Experimental results on real-world images demonstrate that both algorithms can generate high-quality interpolated images across multiple distributions.
Y. He, A. Benhaim, B. Patra, P. Vaddamanu, S. Ahuja, P. Chopra, V. Chaudhary, H. Zhao, X. Song
In Proceedings of the 63th Annual Meeting of the Association for Computational Linguistics (ACL 2025 Findings)
[abs] [pdf]
We propose a novel scaling law for general-purpose decoder-only language models (LMs) trained on multilingual data, tackling the problem of balancing languages during multilingual pretraining. A primary challenge in studying multilingual scaling is the difficulty of analyzing individual language performance due to cross-lingual transfer. To address this, we shift the focus from individual languages to language families. We introduce and validate a hypothesis that the test cross-entropy loss for each language family is determined solely by its own sampling ratio, independent of other languages in the mixture. This insight simplifies the complexity of multilingual scaling and make the analysis scalable to an arbitrary number of languages. Building on this hypothesis, we derive a power-law relationship that links performance with dataset size, model size and sampling ratios. This relationship enables us to predict performance across various combinations of the above three quantities, and derive the optimal sampling ratios at different model scales. To demonstrate the effectiveness and accuracy of our proposed scaling law, we perform a large-scale empirical study, training more than 100 models on 23 languages spanning 5 language families. Our experiments show that the optimal sampling ratios derived from small models (85M parameters) generalize effectively to models that are several orders of magnitude larger (1.2B parameters), offering a resource-efficient approach for multilingual LM training at scale.
J. Zhang, R. Pan, Y. Hu, K. Shum, G. Yao, X. Liu, R. Pi, H. Dong, S. Diao, Y. Lin, H. Zhao, T. Zhang
In ACM Transactions on Intelligent Systems and Technology (TIST 2025).
[abs] [pdf]
The AI community has witnessed the emergence of various chat-style Large Language Models (LLMs) since the advent of ChatGPT. Despite significant progress in this area, evaluating these models remains a substantial challenge. The evaluations provided by humans or GPT-4 oracles are often taken as the gold standard, but they are neither automatic nor scalable. More recently, a series of (open-source) LLM-based judge models have been introduced, yet they often exhibit model-specific biases, e.g., a LLaMA-family judge favors a LLaMAfamily model. On the other hand, autoregressive evaluation metrics, which holds the potential to address the aforementioned issues, remains underexplored. Among them, likelihood-based metrics such as perplexity and negative log-likelihood (NLL) are widely adopted and has proven effective in tracking the pretraining progress of LLMs. However, they struggle to evaluate the generation capabilities of fine-tuned models due to exposure bias, a phenomenon where the distribution of the model's output gradually deviates from the ground-truth during inference. To address this key issue, in this paper, we propose a novel autoregressive metric, Normalized Discounted Cumulative Gain (NDCG), to improve the evaluation of fine-tuned LLMs. Our experimental results demonstrate that NDCG significantly outperforms likelihood-based metrics: it shows over 45% improvement in both Spearman and Kendall's tau correlation coefficients for commonsense QA tasks, and aligns more closely with GPT-4 Elo rankings for instruction-tuned models.
S. Zeng, S. Du, M. Yamada, H. Zhao
In Proceedings of the 13th International Conference on Learning Representations (ICLR 2025)
[abs] [pdf] [code]
To embed structured knowledge within labels into feature representations, prior work (Zeng et al., 2022) proposed to use the Cophenetic Correlation Coefficient (CPCC) as a regularizer during supervised learning. This regularizer calculates pairwise Euclidean distances of class means and aligns them with the corresponding shortest path distances derived from the label hierarchy tree. However, class means may not be good representatives of the class conditional distributions, especially when they are multi-mode in nature. To address this limitation, under the CPCC framework, we propose to use the Earth Mover's Distance (EMD) to measure the pairwise distances among classes in the feature space. We show that our exact EMD method generalizes previous work, and recovers the existing algorithm when class-conditional distributions are Gaussian in the feature space. To further improve the computational efficiency of our method, we introduce the Optimal Transport-CPCC family by exploring four EMD approximation variants. Our most efficient OT-CPCC variant runs in linear time in the size of the dataset, while maintaining competitive performance across datasets and tasks. The code is available at https://github.com/uiuctml/OTCPCC.
H. Zhao, Y. Wang, H. Qi, Z. Huang, H. Zhao, L. Sha, H. Shao
In Proceedings of the 13th International Conference on Learning Representations (ICLR 2025)
[abs] [pdf] [code]
Neural Ordinary Differential Equations (Neural ODEs or NODEs) excel at modeling continuous dynamical systems from observational data, especially when the data is irregularly sampled. However, existing training methods predominantly rely on numerical ODE solvers, which are time-consuming and prone to accumulating numerical errors over time due to autoregression. In this work, we propose VF-NODE, a novel approach based on the variational formulation (VF) to accelerate the training of NODEs. Unlike existing training methods, the proposed VF-NODEs implement a series of global integrals, thus evaluating Deep Neural Network (DNN)--based vector fields only at specific observed data points. This strategy drastically reduces the number of function evaluations (NFEs). Moreover, our method eliminates the use of autoregression, thereby reducing error accumulations for modeling dynamical systems. Nevertheless, the VF loss introduces oscillatory terms into the integrals when using the Fourier basis. We incorporate Filon's method to address this issue. To further enhance the performance for noisy and incomplete data, we employ the natural cubic spline regression to estimate a closed-form approximation. We provide a fundamental analysis of how our approach minimizes computational costs. Extensive experiments demonstrate that our approach accelerates NODE training by 10 to 1000 times compared to existing NODE-based methods, while achieving higher or comparable accuracy in dynamical systems. The code is available at https://github.com/ZhaoHongjue/VF-NODE-ICLR2025.
S. Poppi, ZX. Yong, Y. He, B. Chern, H. Zhao, A. Yang, J. Chi
In Findings of the 2025 Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics (NAACL Findings 2025)
[abs] [pdf]
Recent advancements in Large Language Models (LLMs) have sparked widespread concerns about their safety. Recent work demonstrates that safety alignment of LLMs can be easily removed by fine-tuning with a few adversarially chosen instruction-following examples, i.e., fine-tuning attacks. We take a further step to understand fine-tuning attacks in multilingual LLMs. We first discover cross-lingual generalization of fine-tuning attacks: using a few adversarially chosen instruction-following examples in one language, multilingual LLMs can also be easily compromised (e.g., multilingual LLMs fail to refuse harmful prompts in other languages). Motivated by this finding, we hypothesize that safety-related information is language-agnostic and propose a new method termed Safety Information Localization (SIL) to identify the safety-related information in the model parameter space. Through SIL, we validate this hypothesis and find that only changing 20% of weight parameters in fine-tuning attacks can break safety alignment across all languages. Furthermore, we provide evidence to the alternative pathways hypothesis for why freezing safety-related parameters does not prevent fine-tuning attacks, and we demonstrate that our attack vector can still jailbreak LLMs adapted to new languages.
Y. He, Y. Hu, Y. Lin, T. Zhang, H. Zhao
Transactions on Machine Learning Research (TMLR 2025, J2C certificate)
[abs] [pdf] [code]
Model merging offers an effective strategy to combine the strengths of multiple finetuned models into a unified model that preserves the specialized capabilities of each. Existing methods merge models in a global manner, performing arithmetic operations across all model parameters. However, such global merging often leads to task interference, degrading the performance of the merged model. In this work, we introduce Localize-and-Stitch, a novel approach that merges models in a localized way. Our algorithm works in two steps: i) Localization: identify tiny (1% of the total parameters) localized regions in the finetuned models containing essential skills for the downstream tasks, and ii) Stitching: reintegrate only these essential regions back into the pretrained model for task synergy. We demonstrate that our approach effectively locates sparse regions responsible for finetuned performance, and the localized regions could be treated as compact and interpretable representations of the finetuned models (tasks). Empirically, we evaluate our method on various vision and language benchmarks, showing that it outperforms existing model merging methods under different data availability scenarios. Beyond strong empirical performance, our algorithm also facilitates model compression and preserves pretrained knowledge, enabling flexible and continual skill composition from multiple finetuned models with minimal storage and computational overhead.
Y. Hu, P. Hu, H. Zhao, J. Ma
In Proceedings of the 38th Advances in Neural Information Processing Systems (NeurIPS 2024)
[abs] [pdf] [poster]
How can we attribute the behaviors of machine learning models to their training data? While the classic influence function sheds light on the impact of individual samples, it often fails to capture the more complex and pronounced collective influence of a set of samples. To tackle this challenge, we study the Most Influential Subset Selection (MISS) problem, which aims to identify a subset of training samples with the greatest collective influence. We conduct a comprehensive analysis of the prevailing approaches in MISS, elucidating their strengths and weaknesses. Our findings reveal that influence-based greedy heuristics, a dominant class of algorithms in MISS, can provably fail even in linear regression. We delineate the failure modes, including the errors of influence function and the non-additive structure of the collective influence. Conversely, we demonstrate that an adaptive version of these heuristics which applies them iteratively, can effectively capture the interactions among samples and thus partially address the issues. Experiments on real-world datasets corroborate these theoretical findings, and further demonstrate that the merit of adaptivity can extend to more complex scenarios such as classification tasks and non-linear neural networks. We conclude our analysis by emphasizing the inherent trade-off between performance and computational efficiency, questioning the use of additive metrics such as the linear datamodeling score, and offering a range of discussions.
A. Sinha, S. Zeng, M. Yamada, H. Zhao
In Proceedings of the 38th Advances in Neural Information Processing Systems (NeurIPS 2024)
[abs] [pdf] [code] [poster]
Most real-world datasets consist of a natural hierarchy between classes or an inherent label structure that is either already available or can be constructed cheaply. However, most existing representation learning methods ignore this hierarchy, treating labels as permutation invariant. Recent work [Zeng et al., 2022] proposes using this structured information explicitly, but the use of Euclidean distance may distort the underlying semantic context [Chen et al., 2013]. In this work, motivated by the advantage of hyperbolic spaces in modeling hierarchical relationships, we propose a novel approach HypStructure: a Hyperbolic Structured regularization approach to accurately embed the label hierarchy into the learned representations. HypStructure is a simple-yet-effective regularizer that consists of a hyperbolic tree-based representation loss along with a centering loss, and can be combined with any standard task loss to learn hierarchy-informed features. Extensive experiments on several large-scale vision benchmarks demonstrate the efficacy of HypStructure in reducing distortion and boosting generalization performance especially under low dimensional scenarios. For a better understanding of structured representation, we perform eigenvalue analysis that links the representation geometry to improved Out-of-Distribution (OOD) detection performance seen empirically.
L. Yin, H. Zhao
In Proceedings of the 38th Advances in Neural Information Processing Systems (NeurIPS 2024)
[abs] [pdf] [poster]
Probabilistic circuits (PCs) have emerged as a powerful framework to compactly represent probability distributions for efficient and exact probabilistic inference. It has been shown that PCs with a general directed acyclic graph (DAG) structure can be understood as a mixture of exponentially (in its height) many components, each of which is a product distribution over univariate marginals. However, existing structure learning algorithms for PCs often generate tree-structured circuits or use tree-structured circuits as intermediate steps to compress them into DAG-structured circuits. This leads to the intriguing question of whether there exists an exponential gap between DAGs and trees for the PC structure. In this paper, we provide a negative answer to this conjecture by proving that, for n variables, there exists a quasi-polynomial upper bound $n^O(\log n)$ on the size of an equivalent tree computing the same probability distribution. On the other hand, we also show that given a depth restriction on the tree, there is a super-polynomial separation between tree and DAG-structured PCs. Our work takes an important step towards understanding the expressive power of tree-structured PCs, and our techniques may be of independent interest in the study of structure learning algorithms for PCs.
E. Ma, C. Pan, S. Rasoul Etesami, H. Zhao, O. Milenkovic
In Proceedings of the 38th Advances in Neural Information Processing Systems (NeurIPS 2024)
[abs] [pdf]
The performance of Transfer Learning (TL) heavily relies on effective pretraining, which demands large datasets and substantial computational resources. As a result, executing TL is often challenging for individual model developers. Federated Learning (FL) addresses these issues by facilitating collaborations among clients, expanding the dataset indirectly, distributing computational costs, and preserving privacy. However, key challenges remain unresolved. First, existing FL methods tend to optimize transferability only within local domains, neglecting the global learning domain. Second, most approaches rely on indirect transferability metrics, which do not accurately reflect the final target loss or true degree of transferability. To address these gaps, we propose two enhancements to FL. First, we introduce a client-server exchange protocol that leverages cross-client Jacobian (gradient) norms to boost transferability. Second, we increase the average Jacobian norm across clients at the server, using this as a local regularizer to reduce cross-client Jacobian variance. Our transferable federated algorithm, termed FedGTST (Federated Global Transferability via Statistics Tuning), demonstrates that increasing the average Jacobian and reducing its variance allows for tighter control of the target loss. This leads to an upper bound on the target loss in terms of the source loss and source-target domain discrepancy. Extensive experiments on datasets such as MNIST to MNIST-M and CIFAR10 to SVHN show that FedGTST outperforms relevant baselines, including FedSR. On the second dataset pair, FedGTST improves accuracy by 9.8% over FedSR and 7.6% over FedIIR when LeNet is used as the backbone.
X. Zhang, L. Zhao, Y. Yu, X. Lin, Y. Chen, H. Zhao, Q. Zhang
In Proceedings of the 38th Advances in Neural Information Processing Systems, Track on Datasets and Benchmarks (NeurIPS 2024, D&B Track)
[abs] [pdf] [code]
Multiobjective optimization problems (MOPs) are prevalent in machine learning, with applications in multi-task learning, learning under fairness or robustness constraints, etc. Instead of reducing multiple objective functions into a scalar objective, MOPs aim to optimize for the so-called Pareto optimality or Pareto set learning, which involves optimizing more than one objective function simultaneously, over models with thousands / millions of parameters. Existing benchmark libraries for MOPs mainly focus on evolutionary algorithms, most of which are zeroth-order / meta-heuristic methods that do not effectively utilize higher-order information from objectives and cannot scale to large-scale models with thousands / millions of parameters. In light of the above gap, this paper introduces LibMOON, the first multiobjective optimization library that supports state-of-the-art gradient-based methods, provides a fair benchmark, and is open-sourced for the community.
H. Wang, W. Xiong, T. Xie, H. Zhao, T. Zhang
In Proceedings of the Association for Computational Linguistics: EMNLP 2024 (EMNLP 2024 Findings)
[abs] [pdf] [code]
Reinforcement learning from human feedback (RLHF) has emerged as the primary method for aligning large language models (LLMs) with human preferences. The RLHF process typically starts by training a reward model (RM) using human preference data. Conventional RMs are trained on pairwise responses to the same user request, with relative ratings indicating which response humans prefer. The trained RM serves as a proxy for human preferences. However, due to the black-box nature of RMs, their outputs lack interpretability, as humans cannot intuitively understand why an RM thinks a response is good or not. As RMs act as human preference proxies, we believe they should be human-interpretable to ensure that their internal decision processes are consistent with human preferences and to prevent reward hacking in LLM alignment. To build RMs with interpretable preferences, we propose a two-stage approach: i) train an Absolute-Rating Multi-Objective Reward Model (ArmoRM) with multi-dimensional absolute-rating data, each dimension corresponding to a human-interpretable objective (e.g., honesty, verbosity, safety); ii) employ a Mixture-of-Experts (MoE) strategy with a gating network that automatically selects the most suitable reward objectives based on the context. We efficiently trained an ArmoRM with Llama-3 8B and a gating network consisting of a shallow MLP on top of the ArmoRM. Our trained model, ArmoRM-Llama3-8B, obtains state-of-the-art performance on RewardBench, a benchmark evaluating RMs for language modeling. Notably, the performance of our model surpasses the LLM-as-a-judge method with GPT-4 judges by a margin, and approaches the performance of the much larger Nemotron-4 340B reward model.
Y. He, H. Wang, Z. Jiang, A. Papangelis, H. Zhao
In Proceedings of the Association for Computational Linguistics: EMNLP 2024 (EMNLP 2024 Findings)
[abs] [pdf] [code]
Reward models (RM) capture the values and preferences of humans and play a central role in Reinforcement Learning with Human Feedback (RLHF) to align pretrained large language models (LLMs). Traditionally, training these models relies on extensive human-annotated preference data, which poses significant challenges in terms of scalability and cost. To overcome these limitations, we propose Semi-Supervised Reward Modeling (SSRM), an approach that enhances RM training using unlabeled data. Given an unlabeled dataset, SSRM involves three key iterative steps: pseudo-labeling unlabeled examples, selecting high-confidence examples through a confidence threshold, and supervised finetuning on the refined dataset. Across extensive experiments on various model configurations, we demonstrate that SSRM significantly improves reward models without incurring additional labeling costs. Notably, SSRM can achieve performance comparable to models trained entirely on labeled data of equivalent volumes. Overall, SSRM substantially reduces the dependency on large volumes of human-annotated data, thereby decreasing the overall cost and time involved in training effective reward models.
Y. Lin, H. Lin, W. Xiong, S. Diao, J. Liu, J. Zhang, R. Pan, H. Wang, W. Hu, H. Zhang, H. Dong, R. Pi, H. Zhao, N. Jiang, H. Ji, Y. Yao, T. Zhang
In Proceedings of the Association for Computational Linguistics: EMNLP 2024 (EMNLP 2024)
[abs] [pdf] [code]
LLMs acquire a wide range of abilities during pre-training, but aligning LLMs under Reinforcement Learning with Human Feedback (RLHF) can lead to forgetting pretrained abilities, which is also known as the alignment tax. To investigate alignment tax, we conducted experiments with existing RLHF algorithms using OpenLLaMA-3B, which revealed a pronounced alignment tax in NLP tasks. Whereas, despite various techniques to mitigate forgetting, they are often at odds with the RLHF performance, leading to a trade-off between alignment performance and forgetting mitigation, leading to an alignment-forgetting trade-off. In this paper we show that model averaging, which simply interpolates between pre and post RLHF model weights, surprisingly achieves the most strongest alignment-forgetting Pareto front among a wide range of competing methods. To understand its effectiveness, we offer theoretical insights into model averaging, revealing that it enhances performance Pareto front by increasing feature diversity on the layers where tasks share overlapped feature spaces. Empirical evidence corroborates our analysis by showing the benefits of averaging low-level transformer layers. Building on the analysis and the observation that averaging different layers of the transformer leads to significantly different alignment-forgetting trade-offs, we propose Heterogeneous Model Averaging (HMA) to Heterogeneously find various combination ratios of model layers. HMA seeks to maximize the alignment performance while incurring minimal alignment tax. Moreover, we validate HMA's performance across a range of RLHF algorithms over OpenLLaMA-3B and further extend our findings to Mistral-7B which is evaluated by open-sourced preference model and GPT4.
R. Xian, H. Zhao
arXiv preprint
[abs] [pdf] [code]
We present a post-processing algorithm for fair classification that covers group fairness criteria including statistical parity, equal opportunity, and equalized odds under a single framework, and is applicable to multiclass problems in both attribute-aware and attribute-blind settings. Our algorithm, called "LinearPost", achieves fairness post-hoc by linearly transforming the predictions of the (unfair) base predictor with a "fairness risk" according to a weighted combination of the (predicted) group memberships. It yields the Bayes optimal fair classifier if the base predictors being post-processed are Bayes optimal, otherwise, the resulting classifier may not be optimal, but fairness is guaranteed as long as the group membership predictor is multicalibrated. The parameters of the post-processing can be efficiently computed and estimated from solving an empirical linear program. Empirical evaluations demonstrate the advantage of our algorithm in the high fairness regime compared to existing post-processing and in-processing fair classification algorithms.
H. Zhao
AI Magazine (an overview of our group's work on algorithmic fairness and more broadly, trustworthy machine learning)
[abs] [link]
With the development of machine learning algorithms and the increasing computational resources available, artificial intelligence has achieved great success in many application domains. However, the success of machine learning has also raised concerns about the fairness of the learned models. For instance, the learned models can perpetuate and even exacerbate the potential bias and discrimination in the training data. This issue has become a major obstacle to the deployment of machine learning systems in high-stakes domains, for example, criminal judgment, medical testing, online advertising, hiring process, and so forth. To mitigate the potential bias exhibited by machine learning models, fairness criteria can be integrated into the training process to ensure fair treatment across all demographics, but it often comes at the expense of model performance. Understanding such tradeoffs, therefore, is crucial to the design of optimal and fair algorithms. My research focuses on characterizing the inherent tradeoff between fairness and accuracy in machine learning, and developing algorithms that can achieve both fairness and optimality. In this article, I will discuss our recent work on designing post-processing algorithms for fair classification, which can be applied to a wide range of fairness criteria, including statistical parity, equal opportunity, and equalized odds, under both attribute-aware and attribute-blind settings, and is particularly suited to large-scale foundation models where retraining is expensive or even infeasible. I will also discuss the connections between our work and other related research on trustworthy machine learning, including the connections between algorithmic fairness and differential privacy as well as adversarial robustness.
M. Yamada, Y. Takezawa, G. Houry, K. Düsterwald, D. Sulem, H. Zhao, Y. H. Tsai
Entropy (Entropy 2024)
[abs] [arXiv] [Entropy]
In this study, we delve into the problem of self-supervised learning (SSL) utilizing the 1-Wasserstein distance on a tree structure (a.k.a., Tree-Wasserstein distance (TWD)), where TWD is defined as the L1 distance between two tree-embedded vectors. In SSL methods, the cosine similarity is often utilized as an objective function; however, it has not been well studied when utilizing the Wasserstein distance. Training the Wasserstein distance is numerically challenging. Thus, this study empirically investigates a strategy for optimizing the SSL with the Wasserstein distance and finds a stable training procedure. More specifically, we evaluate the combination of two types of TWD (total variation and ClusterTree) and several probability models, including the softmax function, the ArcFace probability model, and simplicial embedding. We propose a simple yet effective Jeffrey divergence-based regularization method to stabilize optimization. Through empirical experiments on STL10, CIFAR10, CIFAR100, and SVHN, we find that a simple combination of the softmax function and TWD can obtain significantly lower results than the standard SimCLR. Moreover, a simple combination of TWD and SimSiam fails to train the model. We find that the model performance depends on the combination of TWD and probability model, and that the Jeffrey divergence regularization helps in model training. Finally, we show that the appropriate combination of the TWD and probability model outperforms cosine similarity-based representation learning.
Y. He, H. Wang, B. Li, H. Zhao
Journal of Machine Learning Research (JMLR 2024)
(Extended version of our ICML 2022 paper under title "Understanding Gradual Domain Adaptation: Improved Analysis, Optimal Path and Beyond")
[abs] [arXiv] [JMLR] [code]
Unsupervised domain adaptation (UDA) adapts a model from a labeled source domain to an unlabeled target domain in a one-off way. Though widely applied, UDA faces a great challenge whenever the distribution shift between the source and the target is large. Gradual domain adaptation (GDA) mitigates this limitation by using intermediate domains to gradually adapt from the source to the target domain. In this work, we first theoretically analyze gradual self-training, a popular GDA algorithm, and provide a significantly improved generalization bound compared with Kumar et al. (2020). Our theoretical analysis leads to an interesting insight: to minimize the generalization error on the target domain, the sequence of intermediate domains should be placed uniformly along the Wasserstein geodesic between the source and target domains. The insight is particularly useful under the situation where intermediate domains are missing or scarce, which is often the case in real-world applications. Based on the insight, we propose Generative Gradual DOmain Adaptation with Optimal Transport (GOAT), an algorithmic framework that can generate intermediate domains in a data-dependent way. More concretely, we first generate intermediate domains along the Wasserstein geodesic between two given consecutive domains in a feature space, then apply gradual self-training to adapt the source-trained classifier to the target along the sequence of intermediate domains. Empirically, we demonstrate that our GOAT framework can improve the performance of standard GDA when the given intermediate domains are scarce, significantly broadening the real-world application scenarios of GDA. Our code is available at https://github.com/yifei-he/GOAT.
Y. He, R. Cheng, G. Balasubramaniam, Y. H. Tsai, and H. Zhao
Journal of Machine Learning Research (JMLR 2024)
[abs] [pdf]
Multimodal learning aims to learn from data of different modalities by fusing information from heterogeneous sources. Although it is beneficial to learn from more modalities, it is often infeasible to use all available modalities under limited computational resources. Modeling with all available modalities can also be inefficient and unnecessary when information across input modalities overlaps. In this paper, we study the modality selection problem, which aims to select the most useful subset of modalities for learning under a cardinality constraint. To that end, we propose a unified theoretical framework to quantify the learning utility of modalities, and we identify dependence assumptions to flexibly model the heterogeneous nature of multimodal data, which also allows efficient algorithm design. Accordingly, we derive a greedy modality selection algorithm via submodular maximization, which selects the most useful modalities with an optimality guarantee on learning performance. We also connect marginal-contribution-based feature importance scores, such as Shapley value, from the feature selection domain to the context of modality selection, to efficiently compute the importance of individual modality. We demonstrate the efficacy of our theoretical results and modality selection algorithms on 2 synthetic and 4 real-world data sets on a diverse range of multimodal data.
H. Wang, Y. Lin, W. Xiong, R. Yang, S. Diao, S. Qiu, H. Zhao, T. Zhang
In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024)
[abs] [arXiv] [code] [poster]
Fine-grained control over large language models (LLMs) remains a significant challenge, hindering their adaptability to diverse user needs. While Reinforcement Learning from Human Feedback (RLHF) shows promise in aligning LLMs, its reliance on scalar rewards often limits its ability to capture diverse user preferences in real-world applications. To address this limitation, we introduce the Directional Preference Alignment (DPA) framework. Unlike the scalar-reward RLHF, DPA incorporates multi-objective reward modeling to represent diverse preference profiles. Additionally, DPA models user preferences as directions (i.e., unit vectors) in the reward space to achieve user-dependent preference control. Our method involves training a multi-objective reward model and then fine-tuning the LLM with a preference-conditioned variant of Rejection Sampling Finetuning (RSF), an RLHF method adopted by Llama 2. This method enjoys a better performance trade-off across various reward objectives. In comparison with the scalar-reward RLHF, DPA offers users intuitive control over LLM generation: they can arithmetically specify their desired trade-offs (e.g., more helpfulness with less verbosity). We also validate the effectiveness of DPA with real-world alignment experiments on Mistral-7B. Our method provides straightforward arithmetic control over the trade-off between helpfulness and verbosity while maintaining competitive performance with strong baselines such as Direct Preference Optimization (DPO).
R. Xian, Q. Li, G. Kamath, H. Zhao
In Proceedings of the 41st International Conference on Machine Learning (ICML 2024)
[abs] [pdf] [code]
This paper describes a differentially private post-processing algorithm for learning fair regressors satisfying statistical parity, addressing privacy concerns of machine learning models trained on sensitive data, as well as fairness concerns of their potential to propagate historical biases. Our algorithm can be applied to post-process any given regressor to improve fairness by remapping its outputs. It consists of three steps: first, the output distributions are estimated privately via histogram density estimation and the Laplace mechanism, then their Wasserstein barycenter is computed, and the optimal transports to the barycenter are used for post-processing to satisfy fairness. We analyze the sample complexity of our algorithm and provide fairness guarantee, revealing a trade-off between the statistical bias and variance induced from the choice of the number of bins in the histogram, in which using less bins always favors fairness at the expense of error.
S. Liu, D. Zou, H. Zhao, P. Li
In Proceedings of the 41st International Conference on Machine Learning (ICML 2024, spotlight)
[abs] [pdf]
Graph-based methods, pivotal for label inference over interconnected objects in many real-world applications, often encounter generalization challenges, if the graph used for model training differs significantly from the graph used for testing. This work delves into Graph Domain Adaptation (GDA) to address the unique complexities of distribution shifts over graph data, where interconnected data points experience shifts in features, labels, and in particular, connecting patterns. We propose a novel, theoretically principled method, Pairwise Alignment (Pair-Align) to counter graph structure shift by mitigating conditional structure shift (CSS) and label shift (LS). Pair-Align uses edge weights to recalibrate the influence among neighboring nodes to handle CSS and adjusts the classification loss with label weights to handle LS. Our method demonstrates superior performance in real-world applications, including node classification with region shift in social networks, and the pileup mitigation task in particle colliding experiments. For the first application, we also curate the largest dataset by far for GDA studies. Our method shows strong performance in synthetic and other existing benchmark datasets.
Y. He, S. Zhou, G. Zhang, H. Yun, Y. Xu, B. Zeng, T. Chilimbi, H. Zhao
In Proceedings of the 41st International Conference on Machine Learning (ICML 2024)
[abs] [pdf]
Multi-task learning (MTL) considers learning a joint model for multiple tasks by optimizing a convex combination of all task losses. To solve the optimization problem, existing methods use an adaptive weight updating scheme, where task weights are dynamically adjusted based on their respective losses to prioritize difficult tasks. However, these algorithms face a great challenge whenever label noise is present, in which case excessive weights tend to be assigned to noisy tasks that have relatively large Bayes optimal errors, thereby overshadowing other tasks and causing performance to drop across the board. To overcome this limitation, we propose Multi-Task Learning with Excess Risks (ExcessMTL), an excess risk-based task balancing method that updates the task weights by their distances to convergence instead. Intuitively, ExcessMTL assigns higher weights to worse-trained tasks that are further from convergence. To estimate the excess risks, we develop an efficient and accurate method with Taylor approximation. Theoretically, we show that our proposed algorithm achieves convergence guarantees and Pareto stationarity. Empirically, we evaluate our algorithm on various MTL benchmarks and demonstrate its superior performance over existing methods in the presence of label noise.
Y. Yang, M. Lin, H. Zhao, Y. Peng, F. Huang, Z. Lu
Journal of Biomedical Informatics (JBI 2024)
[abs] [pdf] [arXiv]
Artificial intelligence (AI) systems have the potential to revolutionize clinical practices, including improving diagnostic accuracy and surgical decision-making, while also reducing costs and manpower. However, it is important to recognize that these systems may perpetuate social inequities or demonstrate biases, such as those based on race or gender. Such biases can occur before, during, or after the development of AI models, making it critical to understand and address potential biases to enable the accurate and reliable application of AI models in clinical settings. To mitigate bias concerns during model development, we surveyed recent publications on different debiasing methods in the fields of biomedical natural language processing (NLP) or computer vision (CV). Then we discussed the methods that have been applied in the biomedical domain to address bias. We performed our literature search on PubMed, ACM digital library, and IEEE Xplore of relevant articles published between January 2018 and December 2023 using multiple combinations of keywords. We then filtered the result of 10,041 articles automatically with loose constraints, and manually inspected the abstracts of the remaining 890 articles to identify the 55 articles included in this review. Additional articles in the references are also included in this review. We discuss each method and compare its strengths and weaknesses. Finally, we review other potential methods from the general domain that could be applied to biomedicine to address bias and improve fairness. The bias of AIs in biomedicine can originate from multiple sources. Existing debiasing methods that focus on algorithms can be categorized into distributional or algorithmic.
Z. Gong, B. Usman, H. Zhao and D. I. Inouye
In Proceedings of the 27th International Conference on Artificial Intelligence and Statistics (AISTATS 2024)
[abs] [pdf]
Distribution alignment can be used to learn invariant representations with applications in fairness and robustness. Most prior works resort to adversarial alignment methods but the resulting minimax problems are unstable and challenging to optimize. Non-adversarial likelihood-based approaches either require model invertibility, impose constraints on the latent prior, or lack a generic framework for alignment. To overcome these limitations, we propose a non-adversarial VAE-based alignment method that can be applied to any model pipeline. We develop a set of alignment upper bounds (including a noisy bound) that have VAE-like objectives but with a different perspective. We carefully compare our method to prior VAE-based alignment approaches both theoretically and empirically. Finally, we demonstrate that our novel alignment losses can replace adversarial losses in standard invariant representation learning pipelines without modifying the original architectures -- thereby significantly broadening the applicability of non-adversarial alignment methods.
G. Houry, H. Bao, H. Zhao and M. Yamada
In Proceedings of the 27th International Conference on Artificial Intelligence and Statistics (AISTATS 2024)
[abs] [pdf] [poster]
Among numerous linear approximation methods proposed for optimal transport (OT), tree-based methods appear to be fairly reliable, notably for language processing applications. Inspired by these tree methods, we introduce several greedy heuristics aiming to compute even faster approximations of OT. We first explicitly establish the equivalence between greedy matching and optimal transport for tree metrics, and then we show that tree greedy matching can be reduced to greedy matching on a one-dimensional line. Next, we propose two new greedy-based algorithms in one dimension: the $k$-Greedy and 1D-ICT algorithms. This novel approach provides Wasserstein approximations with accuracy similar to the original tree methods on text datasets while being faster in practice. Finally, these algorithms are applicable beyond tree approximations: using sliced projections of the original data still provides fairly good accuracy while eliminating the need for embedding the data in a fixed and rigid tree structure. This property makes these approaches even more versatile than the original tree OT methods.
X. Han, J. Chi, Y. Chen, Q. Wang, H. Zhao, N. Zou, X. Hu
In Proceedings of the 12th International Conference on Learning Representations (ICLR 2024)
[abs] [pdf]
This paper introduces the Fair Fairness Benchmark (\textsf{FFB}), a benchmarking framework for in-processing group fairness methods. Ensuring fairness in machine learning is critical for ethical and legal compliance. However, there exist challenges in comparing and developing of fairness methods due to inconsistencies in experimental settings, lack of accessible algorithmic implementations, and limited extensibility of current fairness packages and tools. To address these issues, we introduce an open-source, standardized benchmark for evaluating in-processing group fairness methods and provide a comprehensive analysis of state-of-the-art methods to ensure different notions of group fairness. This work offers the following key contributions: the provision of flexible, extensible, minimalistic, and research-oriented open-source code; the establishment of unified fairness method benchmarking pipelines; and extensive benchmarking, which yields key insights from 45,079 experiments. We believe our work will significantly facilitate the growth and development of the fairness research community.
H. Dong, W. Xiong, B. Pang, H. Wang, H. Zhao, Y. Zhou, N. Jiang, D. Sahoo, C. Xiong, T. Zhang
Transactions on Machine Learning Research (TMLR 2024)
[abs] [pdf] [code]
We present the workflow of Online Iterative Reinforcement Learning from Human Feedback (RLHF) in this technical report, which is widely reported to outperform its offline counterpart by a large margin in the recent large language model (LLM) literature. However, existing open-source RLHF projects are still largely confined to the offline learning setting. In this technical report, we aim to fill in this gap and provide a detailed recipe that is easy to reproduce for online iterative RLHF. In particular, since online human feedback is usually infeasible for open-source communities with limited resources, we start by constructing preference models using a diverse set of open-source datasets and use the constructed proxy preference model to approximate human feedback. Then, we discuss the theoretical insights and algorithmic principles behind online iterative RLHF, followed by a detailed practical implementation. Our trained LLM, \texttt{SFR-Iterative-DPO-LLaMA-3-8B-R}, achieves impressive performance on LLM chatbot benchmarks, including AlpacaEval-2, Arena-Hard, and MT-Bench, as well as other academic benchmarks such as HumanEval and TruthfulQA. We have shown that supervised fine-tuning (SFT) and iterative RLHF can obtain state-of-the-art performance with fully open-source datasets. Further, we have made our models, curated datasets, and comprehensive step-by-step code guidebooks publicly available. Please refer to \url{https://github.com/RLHFlow/RLHF-Reward-Modeling} and \url{https://github.com/RLHFlow/Online-RLHF} for more detailed information.
H. Wang, H. Si, H. Shao, H. Zhao
Transactions on Machine Learning Research (TMLR 2024)
[abs] [pdf]
Real-world applications of machine learning models often confront data distribution shifts, wherein discrepancies exist between the training and test data distributions. In the common multi-domain multi-class setup, as the number of classes and domains scales up, it becomes infeasible to gather training data for every domain-class combination. This challenge naturally leads the quest for models with Compositional Generalization (CG) ability, where models can generalize to unseen domain-class combinations. To delve into the CG challenge, we develop CG-Bench, a suite of CG benchmarks derived from existing real-world image datasets, and observe that the prevalent pretraining-finetuning paradigm on foundational models, such as CLIP and DINOv2, struggles with the challenge. To address this challenge, we propose Compositional Feature Alignment (CFA), a simple two-stage finetuning technique that i) learns two orthogonal linear heads on a pretrained encoder with respect to class and domain labels, and ii) fine-tunes the encoder with the newly learned head frozen. We theoretically and empirically justify that CFA encourages compositional feature learning of pretrained models. We further conduct extensive experiments on CG-Bench for CLIP and DINOv2, two powerful pretrained vision foundation models. Experiment results show that CFA outperforms common finetuning techniques in compositional generalization, corroborating CFA's efficacy in compositional feature learning.
V. Duong, Q. Wu, Z. Zhou, E. Zavesky, W-L. Hsu, H. Zhao, H. Shao
Transactions on Machine Learning Research (TMLR 2024)
[abs] [pdf]
Out-of-distribution (OOD) detection seeks to identify test samples that deviate from the training data, which is critical to ensuring the safety and reliability of machine learning (ML) systems. While a plethora of methods have been developed to detect uni-modal OOD samples, only a few have focused on multi-modal OOD detection. Current contrastive learning-based methods primarily address multi-modal OOD detection in a scenario where an image is not related to the class labels in training data. However, ML systems in the real-world applications may encounter a broader spectrum of anomalies caused by different factors like systematic errors in labeling, environmental changes, and sensor malfunctions. Hence, we propose a new method to be able to simultaneously detect anomalies from multiple different OOD scenarios, arising from fine-grained image features and textual descriptions, instead of large categorical information. To achieve this goal, we propose a general-purpose weakly-supervised OOD detection framework, called WOOD, that combines a binary classifier and a contrastive learning module to reap the benefits of both. In order to better distinguish in-distribution (ID) samples from OOD ones, we employ the Hinge loss to constrain the similarity of their latent representations. Moreover, we devise a new scoring metric that fuses predictions from both the binary classifier and contrastive learning to enhance OOD detection. Extensive experimental results on multiple benchmarks demonstrate that the proposed WOOD significantly outperforms the state-of-the-art methods for multi-modal OOD detection. Importantly, our approach can achieve superior detection performance in a variety of OOD scenarios.
X. Wang, H. Zhao, K. Nahrstedt, S. Koyejo
Transactions on Machine Learning Research (TMLR 2024)
[abs] [pdf]
One of the common approaches for personalizing federated learning is fine-tuning the global model for each local client. While this addresses some issues of statistical heterogeneity, we find that such personalization methods are vulnerable to spurious features at local agents, leading to reduced generalization performance. This work considers a setup where spurious features correlate with the label in each client's training environment, and the mixture of multiple training environments (i.e., the global environment) diminishes the spurious correlations. In other words, while the global federated learning model trained over the global environment suffers less from spurious features, the local fine-tuning step may lead to personalized models vulnerable to spurious correlations. In light of this practical and pressing challenge, we propose a novel strategy to mitigate the effect of spurious features during personalization by maintaining the adversarial transferability between the global and personalized models. Empirical results on object and action recognition tasks show that our proposed approach bounds personalized models from further exploiting spurious features while preserving the benefit of enhanced accuracy from fine-tuning.
Y. Hu, R. Xian, Q. Wu, Q. Fan, L. Yin, and H. Zhao
In Proceedings of the 37th Advances in Neural Information Processing Systems (NeurIPS 2023)
[abs] [pdf] [poster] [slides] [video]
Linear scalarization, i.e., combining all loss functions by a weighted sum, has been the default choice in the literature of multi-task learning (MTL) since its inception. In recent years, there is a surge of interest in developing Specialized Multi-Task Optimizers (SMTOs) that treat MTL as a multi-objective optimization problem. However, it remains open whether there is a fundamental advantage of SMTOs over scalarization. In fact, heated debates exist in the community comparing these two types of algorithms, mostly from an empirical perspective. To approach the above question, in this paper, we revisit scalarization from a theoretical perspective. We focus on linear MTL models and study whether scalarization is capable of fully exploring the Pareto front. Our findings reveal that, in contrast to recent works that claimed empirical advantages of scalarization, scalarization is inherently incapable of full exploration, especially for those Pareto optimal solutions that strike the balanced trade-offs between multiple tasks. More concretely, when the model is under-parametrized, we reveal a multi-surface structure of the feasible region and identify necessary and sufficient conditions for full exploration. This leads to the conclusion that scalarization is in general incapable of tracing out the Pareto front. Our theoretical results partially answer the open questions in Xin et al. (2021), and provide a more intuitive explanation on why scalarization fails beyond non-convexity. We additionally perform experiments on a real-world dataset using both scalarization and state-of-the-art SMTOs. The experimental results not only corroborate our theoretical findings, but also unveil the potential of SMTOs in finding balanced solutions, which cannot be achieved by scalarization.
S. Shin, I. Shomorony, and H. Zhao
In Proceedings of the 37th Advances in Neural Information Processing Systems (NeurIPS 2023)
[abs] [pdf] [poster]
Graph Neural Networks (GNNs) are a powerful class of machine learning models with applications in recommender systems, drug discovery, social network analysis, and computer vision. One challenge with their implementation is that GNNs often take large-scale graphs as inputs, which imposes significant computational/storage costs in the training and testing phases. In particular, the message passing operations of a GNN require multiplication of the graph adjacency matrix A ∈\R^n \times n and the data matrix X ∈\R^n \times d, and the O(n^2 d) time complexity can be prohibitive for large n. Thus, a natural question is whether it is possible to perform the GNN operations in (quasi-)linear time by avoiding the full computation of A X. To study this question, we consider the setting of a regression task on a two-layer Linear Graph Convolutional Network (GCN). We develop an efficient training algorithm based on (1) performing node subsampling, (2) estimating the leverage scores of A X based on the subsampled graph, and (3) performing leverage score sampling on A X. We show that our proposed scheme learns the regression model observing only O(nd\eps^-2\log n) entries of A in time O(nd^2 \eps^-2\log n), with the guarantee that the learned weights deviate by at most εunder the \ell_2 norm from the model learned using the entire adjacency matrix A. We present empirical results for regression problems on two real-world graphs and show that our algorithm significantly outperforms other baseline sampling strategies that exploit the same number of observations.
R. Xian, H. Zhuang, Z. Qin, H. Zamani, J. Lu, J. Ma, K. Hui, H. Zhao, X. Wang, M. Bendersky
In Proceedings of the 37th Advances in Neural Information Processing Systems (NeurIPS 2023, spotlight)
[abs] [pdf] [poster]
Domain adaptation aims to transfer the knowledge acquired by models trained on (data-rich) source domains to (low-resource) target domains, for which a popular method is invariant representation learning. While they have been studied extensively for classification and regression problems, how they apply to ranking problems, where the data and metrics have a list structure, is not well understood. Theoretically, we establish a domain adaptation generalization bound for ranking under listwise metrics such as MRR and NDCG. The bound suggests an adaptation method via learning list-level domain-invariant feature representations, whose benefits are empirically demonstrated by unsupervised domain adaptation experiments on real-world ranking tasks, including passage reranking. A key message is that for domain adaptation, the representations should be analyzed at the same level at which the metric is computed, as we show that learning invariant representations at the list level is most effective for adaptation on ranking problems.
J. Shen, H. Lai, M. Liu, H. Zhao, Y. Yu, and W. Zhang
Journal of Machine Learning Research (JMLR 2023)
(Extended version of our NeurIPS 2020 paper under title "Model-based Policy Optimization with Unsupervised Model Adaptation")
[abs] [pdf]
Compared to model-free reinforcement learning (RL), model-based RL is often more sample efficient by leveraging a learned dynamics model to help decision-making. However, the learned model is usually not perfectly accurate and the error will compound in multi-step predictions, which can lead to poor asymptotic performance. In this paper, we first derive an upper bound of the return discrepancy between the real dynamics and the learned model, which reveals the fundamental problem of distribution shift between simulated data and real data. Inspired by the theoretical analysis, we propose an adaptation augmented model-based policy optimization (AMPO) framework to address the distribution shift problem from the perspectives of feature learning and instance re-weighting, respectively. Specifically, the feature-based variant, namely FAMPO, introduces unsupervised model adaptation to minimize the integral probability metric (IPM) between feature distributions from real and simulated data, while the instance-based variant, termed as IAMPO, utilizes importance sampling to re-weight the real samples used to train the model. Besides model learning, we also investigate how to improve policy optimization in the model usage phase by selecting simulated samples with different probabilities according to their uncertainty. Extensive experiments on challenging continuous control tasks show that FAMPO and IAMPO, coupled with our model usage technique, achieve superior performance against baselines, which demonstrates the effectiveness of the proposed methods.
C. Mavromatis, V. N. Ioannidis, S. Wang, D. Zheng, S. Adeshina, J. Ma, H. Zhao, C. Faloutsos, G. Karypis
In Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECMLPKDD 2023)
[abs] [pdf]
How can we learn effective node representations on textual graphs? Graph Neural Networks (GNNs) that use Language Models (LMs) to encode textual information of graphs achieve state-of-the-art performance in many node classification tasks. Yet, combining GNNs with LMs has not been widely explored for practical deployments due to its scalability issues. In this work, we tackle this challenge by developing a Graph-Aware Distillation framework (GRAD) to encode graph structures into an LM for graph-free, fast inference. Different from conventional knowledge distillation, GRAD jointly optimizes a GNN teacher and a graph-free student over the graph's nodes via a shared LM. This encourages the graph-free student to exploit graph information encoded by the GNN teacher while at the same time, enables the GNN teacher to better leverage textual information from unlabeled nodes. As a result, the teacher and the student models learn from each other to improve their overall performance. Experiments in eight node classification benchmarks in both transductive and inductive settings showcase GRAD's superiority over existing distillation approaches for textual graphs.
R. Xian, L. Yin, H. Zhao
In Proceedings of the 40th International Conference on Machine Learning (ICML 2023)
[abs] [pdf] [code] [slides] [video] [AI Magazine]
Fairness in automated decision-making systems has gained increasing attention as their applications expand to real-world high-stakes domains. To facilitate the design of fair ML systems, it is essential to understand the potential trade-offs between fairness and predictive power, and the construction of the optimal predictor under a given fairness constraint. In this paper, for general classification problems under the group fairness criterion of demographic parity (DP), we precisely characterize the trade-off between DP and classification accuracy, referred to as the minimum cost of fairness. Our insight comes from the key observation that finding the optimal fair classifier is equivalent to solving a Wasserstein-barycenter problem under $\ell_1$-norm restricted to the vertices of the probability simplex. Inspired by our characterization, we provide a construction of an optimal fair classifier achieving this minimum cost via the composition of the Bayes regressor and optimal transports from its output distributions to the barycenter. Our construction naturally leads to an algorithm for post-processing any pre-trained predictor to satisfy DP fairness, complemented with finite sample guarantees. Experiments on real-world datasets verify and demonstrate the effectiveness of our approaches.
Y. Hu, F. Wu, H. Zhang, and H. Zhao
In Proceedings of the 40th International Conference on Machine Learning (ICML 2023)
[abs] [pdf]
While it has long been empirically observed that adversarial robustness may be at odds with standard accuracy and may have further disparate impacts on different classes, it remains an open question to what extent such observations hold and how the class imbalance plays a role within. In this paper, we attempt to understand this question of accuracy disparity by taking a closer look at linear classifiers under a Gaussian mixture model. We decompose the impact of adversarial robustness into two parts: an inherent effect that will degrade the standard accuracy on all classes, and the other caused by the class imbalance ratio, which will increase the accuracy disparity compared to standard training. Furthermore, we also extend our model to the general family of stable distributions. We demonstrate that while the constraint of adversarial robustness consistently degrades the standard accuracy in the balanced class setting, the class imbalance ratio plays a fundamentally different role in accuracy disparity compared to the Gaussian case, due to the heavy tail of the stable distribution. We additionally perform experiments on both synthetic and real-world datasets. The empirical results not only corroborate our theoretical findings, but also suggest that the implications may extend to nonlinear models over real-world datasets.
S. Liu, T. Li, Y. Feng, N. Tran, H. Zhao, Q. Qiu, and Pan Li
In Proceedings of the 40th International Conference on Machine Learning (ICML 2023)
[abs] [pdf]
In many real-world applications, graph-structured data used for training and testing have differences in distribution, such as in high energy physics (HEP) where simulation data used for training may not match real experiments. Graph domain adaptation (GDA) is a method used to address these differences. However, current GDA primarily works by aligning the distributions of node representations output by a single graph neural network encoder shared across the two domains, which may often yield sub-optimal solutions. This work examines different impacts of distribution shifts caused by either graph structure or node attributes and identifies a new type of shift, named conditional structure shift, which current GDA approaches are provably sub-optimal to deal with. A novel approach, called structural reweighting (StruRW), is proposed to address this issue and is tested on synthetic graphs, four benchmark datasets, and a new application in HEP. StruRW has shown significant performance improvement over the baselines in the settings with large graph structure shifts, and reasonable performance improvement when node attribute shifts dominate.
Q. Jiang, C. Chen, H. Zhao, L. Chen, Q. Ping, S. Dinh Tran, Y. Xu, B. Zeng, T. Chilimbi
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2023)
[abs] [pdf]
Contrastive loss has been increasingly used in learning representations from multiple modalities. In the limit, the nature of the contrastive loss encourages modalities to exactly match each other in the latent space. Yet it remains an open question how the modality alignment affects the downstream task performance. In this paper, based on an information-theoretic argument, we first prove that exact modality alignment is sub-optimal in general for downstream prediction tasks. Hence we advocate that the key of better performance lies in meaningful latent modality structures instead of perfect modality alignment. To this end, we propose three general approaches to construct latent modality structures. Specifically, we design 1) a deep feature separation loss for intra-modality regularization; 2) a Brownian-bridge loss for inter-modality regularization; and 3) a geometric consistency loss for both intra- and inter-modality regularization. Extensive experiments are conducted on two popular multi-modal representation learning frameworks: the CLIP-based two-tower model and the ALBEF-based fusion model. We test our model on a variety of tasks including zero/few-shot image classification, image-text retrieval, visual question answering, visual reasoning, and visual entailment. Our method achieves consistent improvements over existing methods, demonstrating the effectiveness and generalizability of our proposed approach on latent modality structure regularization.
H. Zhao
Transactions on Machine Learning Research (TMLR 2023)
[abs] [pdf]
Real-world applications of machine learning tools in high-stakes domains are often regulated to be fair, in the sense that the predicted target should satisfy some quantitative notion of parity with respect to a protected attribute. However, the exact tradeoff between fairness and accuracy with a real-valued target is not entirely clear. In this paper, we characterize the inherent tradeoff between statistical parity and accuracy in the regression setting by providing a lower bound on the error of any attribute-blind fair regressor. Our lower bound is sharp, algorithm-independent, and admits a simple interpretation: when the moments of the target differ between groups, any fair algorithm has to make an error on at least one of the groups. We further extend this result to give a lower bound on the joint error of any (approximately) fair algorithm, using the Wasserstein distance to measure the quality of the approximation. With our novel lower bound, we also show that the price paid by a fair regressor that does not take the protected attribute as input is less than that of a fair regressor with explicit access to the protected attribute. On the upside, we establish the first connection between individual fairness, accuracy parity, and the Wasserstein distance by showing that if a regressor is individually fair, it also approximately verifies the accuracy parity, where the gap is again given by the Wasserstein distance between the two groups. Inspired by our theoretical results, we develop a practical algorithm for fair regression through the lens of representation learning, and conduct experiments on a real-world dataset to corroborate our findings.
S. Zeng, R. des Combes, H. Zhao
In Proceedings of the 11th International Conference on Learning Representations (ICLR 2023)
[abs] [pdf] [code]
Existing models for learning representations in supervised classification problems are permutation invariant with respect to class labels. However, structured knowledge about the classes, such as hierarchical label structures, widely exists in many real-world datasets, e.g., the ImageNet and CIFAR benchmarks. How to learn representations that can preserve such structures among the classes remains an open problem. To approach this problem, given a tree of class hierarchy, we first define a tree metric between any pair of nodes in the tree to be the length of the shortest path connecting them. We then provide a method to learn the hierarchical relationship of class labels by approximately embedding the tree metric in the Euclidean space of features. More concretely, during supervised training, we propose to use the Cophenetic Correlation Coefficient (CPCC) as a regularizer for the cross-entropy loss to correlate the tree metric of classes and the Euclidean distance in the class-conditioned representations. Our proposed regularizer is computationally lightweight and easy to implement. Empirically, we demonstrate that this approach can help to learn more interpretable representations due to the preservation of the tree metric, and leads to better generalization in-distribution as well as under sub-population shifts over multiple datasets.
S. Shin, H. Zhao, I. Shomorony
In Proceedings of the 34th International Conference on Algorithmic Learning Theory (ALT 2023)
[abs] [pdf]
Computing the dominant eigenvectors of a matrix $A$ has many applications, such as principal component analysis, spectral embedding, and PageRank. However, in general, this task relies on the complete knowledge of the matrix $A$, which can be too large to store or even infeasible to observe in many applications, e.g., large-scale social networks. Thus, a natural question is how to accurately estimate the eigenvectors of $A$ when only partial observations can be made by sampling entries from $A$. To this end, we propose the Adaptive Power Method (\apmnospace), a variant of the well-known power method. At each power iteration, \apm adaptively selects a subset of the entries of $A$ to observe based on the current estimate of the top eigenvector. We show that \apm can estimate the dominant eigenvector(s) of $A$ with squared error at most $\epsilon$ by observing roughly $O(n\eps^{-2} \log^2 (n/\eps))$ entries of an $n\times n$ matrix. We present empirical results for the problem of eigenvector centrality computation on two real-world graphs and show that \apm significantly outperforms a non-adaptive estimation algorithm using the same number of observations. Furthermore, in the context of eigenvector centrality, \apm can also adaptively allocate the observation budget to selectively refine the estimate of nodes with high centrality scores in the graph.
Y. Shen, J. Du, H. Zhao, Z. Ji, C. Ma, M. Gao
In Proceedings of the 22nd International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2023)
[abs] [pdf]
Federated adversary domain adaptation is a unique distributed minimax training task due to the heterogeneous data among different local clients, where each client only sees a subset of the data that merely belongs to either the source or target domain. Despite the extensive research in distributed minimax optimization, existing communication efficient solvers that exploit multiple steps of the local update are still not able to generate satisfactory solutions for federated adversarial domain adaptation because of the gradient divergence issue among clients. To tackle this problem, we propose a distributed minimax optimizer, referred to as FedMM, by introducing dual variables to bridge the gradient gap among clients. This algorithm is effective even in the extreme case where each client has different label classes and some clients only have unlabeled data. We prove that FedMM admits benign convergence to a stationary point under domain-shifted unlabeled data. On a variety of benchmark datasets, extensive experiments show that FedMM consistently achieves both better communication savings and significant accuracy improvements over existing federated optimizers based on the stochastic gradient descent ascent (SGDA) algorithm. When training from scratch, for example, it outperforms other SGDA based federated average methods by around 20% in accuracy over the same communication rounds; and it consistently outperforms when training from pre-trained models.
G. Balasubramaniam, H. Wang, H. Zhao
NeurIPS Distribution Shifts (DistShift) Workshop, 2022 (NeurIPS 2022)
[abs] [pdf]
Domain generalization aims to learn a model over multiple training environments to generalize to unseen environments. Recently, \cite{wang2022provable} proposed Invariant-feature Subspace Recovery (ISR), a domain generalization algorithm that uses the means of class-conditional data distributions to provably identify the invariant-feature subspace under a given causal model. However, due to the specific assumptions of the causal model, the original ISR algorithm is conditioned on a single class only, without utilizing information from the rest of the classes. In this work, we consider the setting of multi-class classification under a more general causal model, and propose an extension of the ISR algorithm, called ISR-Multiclass. This proposed algorithm can provably recover the invariant-feature subspace with $\lceil d_{spu}/k \rceil + 1$ environments, where $d_{spu}$ is the number of spurious features and $k$ is the number of classes. Empirically, we first examine ISR-Multiclass in a synthetic dataset, and demonstrate its superiority over the original ISR in the multi-class setting. Furthermore, we conduct experiments in Multiclass Coloured MNIST, a semi-synthetic dataset with strong spurious correlations, and show that ISR-Multiclass can significantly improve the robustness of neural nets trained by various methods (e.g., ERM and IRM) against spurious correlations.
J. Dong, S. Zhou, B. Wang, H. Zhao
Transactions on Machine Learning Research (TMLR 2022)
[abs] [pdf]
The phenomenon of data distribution evolving over time has been observed in a range of applications, calling for the need for adaptive learning algorithms. We thus study the problem of supervised gradual domain adaptation, where labeled data from shifting distributions are available to the learner along the trajectory, and we aim to learn a classifier on a target data distribution of interest. Under this setting, we provide the first generalization upper bound on the learning error under mild assumptions. Our results are algorithm agnostic, general for a range of loss functions, and only depend linearly on the averaged learning error across the trajectory. This shows significant improvement compared to the previous upper bound for unsupervised gradual domain adaptation, where the learning error on the target domain depends exponentially on the initial error on the source domain. Compared with the offline setting of learning from multiple domains, our results also suggest the potential benefits of the temporal structure among different domains in adapting to the target one. Empirically, our theoretical results imply that learning proper representations across the domains will effectively mitigate learning errors. Motivated by these theoretical insights, we propose a min-max learning objective to learn the representation and classifier simultaneously. Experimental results on both semi-synthetic and large-scale real datasets corroborate our findings and demonstrate the effectiveness of our objectives.
H. Zhao*, C. Dan*, B. Aragam, T. Jaakkola, G. Gordon, and P. Ravikumar
Journal of Machine Learning Research (JMLR 2022)
(Also presented at NeurIPS 2023 Journal Track)
[abs] [JMLR] [arXiv] [poster]
A wide range of machine learning applications such as privacy-preserving learning, algorithmic fairness, and domain adaptation/generalization among others, involve learning \emph{invariant representations} of the data that aim to achieve two competing goals: (a) maximize information or accuracy with respect to a target response, and (b) maximize invariance or independence with respect to a set of protected features (e.g.\ for fairness, privacy, etc). Despite their wide applicability, theoretical understanding of the optimal tradeoffs --- with respect to accuracy, and invariance --- achievable by invariant representations is still severely lacking. In this paper, we provide an information theoretic analysis of such tradeoffs under both classification and regression settings. More precisely, we provide a geometric characterization of the accuracy and invariance achievable by any representation of the data; we term this feasible region the information plane. We provide an inner bound for this feasible region for the classification case, and an exact characterization for the regression case, which allows us to either bound or exactly characterize the Pareto optimal frontier between accuracy and invariance. Although our contributions are mainly theoretical, a key practical application of our results is in certifying the potential sub-optimality of any given representation learning algorithm for either classification or regression tasks. Our results shed new light on the fundamental interplay between accuracy and invariance, and may be useful in guiding the design of future representation learning algorithms.
J. Chi, W. Shand, Y. Yu, K.-W. Chang, H. Zhao, and Y. Tian
In Proceedings of the 2022 Empirical Methods in Natural Language Processing (EMNLP 2022 Findings)
[abs] [pdf]
Contrastive representation learning has gained much attention due to its superior performance in learning representations from both image and sequential data. However, the learned representations could potentially lead to performance disparities in downstream tasks, such as increased silencing of underrepresented groups in toxicity comment classification. In light of this challenge, in this work, we study learning fair representations that satisfy a notion of fairness known as equalized odds for text classification via contrastive learning. Specifically, we first theoretically analyze the connections between learning representations with a fairness constraint and \emph{conditional supervised contrastive objectives}, and then propose to use conditional supervised contrastive objectives to learn fair representations for text classification. We conduct experiments on two text datasets to demonstrate the effectiveness of our approaches in balancing the trade-offs between task performance and bias mitigation among existing baselines for text classification. Furthermore, we also show that the proposed methods are stable in different hyperparameter settings.
Z. Chen, R. Jiang, B. Duke, H. Zhao, and P. Aarabi
In Proceedings of the 17th European Conference on Computer Vision (ECCV 2022, oral)
[abs] [pdf]
Generative Adversarial Networks (GANs) have been widely applied in modeling diverse image distributions. However, despite its impressive applications, the structure of the latent space in GANs largely remains as a black-box, leaving its controllable generation an open problem, especially when spurious correlations between different semantic attributes exist in the image distributions. To address this problem, previous methods typically learn linear directions or individual channels that control semantic attributes in the image space. However, they often suffer from imperfect disentanglement, or are unable to obtain multi-directional controls. In this work, in light of the above challenges, we propose a novel approach that discovers nonlinear controls, which enables multi-directional manipulation as well as effective disentanglement, based on gradient information in the learned GAN latent space. More specifically, we first learn interpolation directions by following the gradients from classification networks trained separately on the attributes, and then navigate the latent space by exclusively controlling channels activated for the target attribute in the learned directions. Empirically, with small training data, our approach is able to gain fine-grained controls over a diverse set of bi-directional and multi-directional attributes, and we showcase its ability to achieve disentanglement significantly better than state-of-the-art methods both qualitatively and quantitatively.
R. Cheng, G. Balasubramaniam, Y. He, Y. H. Tsai, H. Zhao
In Proceedings of the 38th Conference on Uncertainty in Artificial Intelligence (UAI 2022)
[abs] [pdf]
Multimodal learning considers learning from multi-modality data, aiming to fuse heterogeneous sources of information. However, it is not always feasible to leverage all available modalities due to memory constraints. Further, training on all the modalities may be inefficient when redundant information exists within data, such as different subsets of modalities providing similar performance. In light of these challenges, we study modality selection, intending to efficiently select the most informative and complementary modalities under certain computational constraints. We formulate a theoretical framework for optimizing modality selection in multimodal learning and introduce a utility measure to quantify the benefit of selecting a modality. For this optimization problem, we present efficient algorithms when the utility measure exhibits monotonicity and approximate submodularity. We also establish a novel correspondence between the utility measure and existing marginal contribution scores based on the Shapely value. Last, we demonstrate the efficacy of our algorithm on synthetic (Patch-MNIST) and real-world (PEMS-SF Traffic) datasets.
H. Wang, B. Li, H. Zhao
In Proceedings of the 39th International Conference on Machine Learning (ICML 2022)
[abs] [pdf] [code]
The vast majority of existing algorithms for unsupervised domain adaptation (UDA) focus on adapting from a labeled source domain to an unlabeled target domain directly in a one-off way. Gradual domain adaptation (GDA), on the other hand, assumes a path of ($T\mathrm{-}1$) unlabeled intermediate domains bridging the source and the target, and aims to provide better generalization on the target domain by leveraging the intermediate ones. Under certain assumptions, \citet{kumar2020understanding} proposed a simple algorithm, \textit{gradual self-training}, along with a generalization bound in the order of $e^{\mathcal O(T)}(\eps_0 \mathrm{+}\mathcal O\bigl(\sqrt {\frac{\log T}{n}}\bigr) \bigr)$ for the target domain error, where $\eps_0$ is the source domain error and $n$ is the data size of each domain. Due to the exponential factor, this upper bound becomes vacuous when $T$ is only moderately large. In this work, we analyze gradual self-training under more general and relaxed assumptions, and prove a significantly improved generalization bound as $\eps_0\mathrm{+}\widetilde{\mathcal O}\bigl(T\Delta \mathrm{+} \frac{T}{\sqrt{n}} \mathrm{+} \frac{1}{\sqrt{nT}}\bigr)$, where $\Delta$ is the average distributional distance between consecutive domains. Compared with the existing bound with an \emph{exponential} dependency on $T$ as a \textit{multiplicative} factor, our bound only depends on $T$ \emph{linearly and additively}. Perhaps more interestingly, our result implies the existence of an optimal choice of $T$ that minimizes the generalization error, and it also naturally suggests an optimal way to construct the path of intermediate domains so as to minimize the accumulative path length $T\Delta$ between the source and the target. To corroborate the implications of our theory, we examine gradual self-training on multiple semi-synthetic and real datasets, which confirms our findings. We believe our insights provide a path forward towards the design of future GDA algorithms.
H. Wang, H. Si, B. Li, H. Zhao
In Proceedings of the 39th International Conference on Machine Learning (ICML 2022)
[abs] [pdf] [code]
Domain generalization asks for models trained on a set of training environments to perform well on unseen test environments. Recently, a series of algorithms such as Invariant Risk Minimization (IRM) has been proposed for domain generalization. However, Rosenfeld et al. (2021) shows that in a simple linear data model, even if non-convexity issues are ignored, IRM and its extensions cannot generalize to unseen environments with less than ds+1 training environments, where ds is the dimension of the spurious-feature subspace. In this paper, we propose to achieve domain generalization with Invariant-feature Subspace Recovery (ISR). Our first algorithm, ISR-Mean, can identify the subspace spanned by invariant features from the first-order moments of the class-conditional distributions, and achieve provable domain generalization with ds+1 training environments under the data model of Rosenfeld et al. (2021). Our second algorithm, ISR-Cov, further reduces the required number of training environments to O(1) using the information of second-order moments. Notably, unlike IRM, our algorithms bypass non-convexity issues and enjoy global convergence guarantees. Empirically, our ISRs can obtain superior performance compared with IRM on synthetic benchmarks. In addition, on three real-world image and text datasets, we show that ISR-Mean can be used as a simple yet effective post-processing method to increase the worst-case accuracy of trained models against spurious correlations and group shifts.
H. Shao, Y. Yang, H. Lin, L. Lin, Y. Chen, Q. Yang, H. Zhao
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022)
[abs] [pdf]
The Controllable Variational Autoencoder (ControlVAE) combines automatic control theory with the basic VAE model to manipulate the KL-divergence for overcoming posterior collapse and learning disentangled representations. It has shown success in a variety of applications, such as image generation, disentangled representation learning, and language modeling. However, when it comes to disentangled representation learning, ControlVAE does not delve into the rationale behind it. The goal of this paper is to develop a deeper understanding of ControlVAE in learning disentangled representations, including the choice of a desired KL-divergence (i.e, set point), and its stability during training. We first fundamentally explain its ability to disentangle latent variables from an information bottleneck perspective. We show that KL-divergence is an upper bound of the variational information bottleneck. By controlling the KL-divergence gradually from a small value to a target value, ControlVAE can disentangle the latent factors one by one. Based on this finding, we propose a new DynamicVAE that leverages a modified incremental PI (proportional-integral) controller, a variant of the proportional-integral-derivative (PID) algorithm, and employs a moving average as well as a hybrid annealing method to evolve the value of KL-divergence smoothly in a tightly controlled fashion. In addition, we analytically derive a lower bound of the set point for disentangling. We then theoretically prove the stability of the proposed approach. Evaluation results on multiple benchmark datasets demonstrate that DynamicVAE achieves a good trade-off between the disentanglement and reconstruction quality. We also discover that it can separate disentangled representation learning and reconstruction via manipulating the desired KL-divergence.
R. Xian, H. Ji, H. Zhao
In Proceedings of the 10th International Conference on Learning Representations (ICLR 2022)
[abs] [pdf] [code]
Recent advances in neural modeling have produced deep multilingual language models capable of extracting cross-lingual knowledge from unparallel texts, evidenced by their decent zero-shot transfer performance. While studies have attributed this success to cross-lingually shared representations, quantitative analyses are sparse. Towards a better understanding of the role of multilingual representations, in this work, we first make the following observations through empirical analysis: (1) invariance of the feature representations strongly correlates with transfer performance, and (2) distributional shift in class priors between data in the source and target languages negatively affects performance -- an issue that is largely overlooked in prior work. Based on our findings, we propose an unsupervised cross-lingual learning method, called importance-weighted domain alignment (IWDA), that performs representation alignment, prior shift estimation, and correction. Experiment results demonstrate its superiority under large prior shifts. In addition, our method delivers further performance gains when combined with existing semi-supervised learning techniques.
Y. H. Tsai, T. Li, M. Q. Ma, H. Zhao, K. Zhang, L-P. Morency, R. Salakhutdinov
In Proceedings of the 10th International Conference on Learning Representations (ICLR 2022)
[abs] [pdf] [code]
Conditional contrastive learning frameworks consider the conditional sampling procedure that constructs positive or negative data pairs conditioned on specific variables. Fair contrastive learning constructs negative pairs, for example, from the same gender (conditioning on sensitive information), which in turn reduces undesirable information from the learned representations; weakly supervised contrastive learning constructs positive pairs with similar annotative attributes (conditioning on auxiliary information), which in turn are incorporated into the representations. Although conditional contrastive learning enables many applications, the conditional sampling procedure can be challenging if we cannot obtain sufficient data pairs for some values of the conditioning variable. This paper presents Conditional Contrastive Learning with Kernel (CCL-K) that converts existing conditional contrastive objectives into alternative forms that mitigate the insufficient data problem. Instead of sampling data according to the value of the conditioning variable, CCLK uses the Kernel Conditional Embedding Operator that samples data from all available data and assigns weights to each sampled data given the kernel similarity between the values of the conditioning variable. We conduct experiments using weakly supervised, fair, and hard negatives contrastive learning, showing CCL-K outperforms state-of-the-art baselines.
H. Zhao and G. Gordon
Journal of Machine Learning Research (JMLR 2022)
(Extended version of an earlier paper with the same title appearing in NeurIPS 2019)
[abs] [JMLR] [arXiv]
Real-world applications of machine learning tools in high-stakes domains are often regulated to be fair, in the sense that the predicted target should satisfy some quantitative notion of parity with respect to a protected attribute. However, the exact tradeoff between fairness and accuracy is not entirely clear, even for the basic paradigm of classification problems. In this paper, we characterize an inherent tradeoff between statistical parity and accuracy in the classification setting by providing a lower bound on the sum of group-wise errors of any fair classifiers. Our impossibility theorem could be interpreted as a certain uncertainty principle in fairness: if the base rates differ among groups, then any fair classifier satisfying statistical parity has to incur a large error on at least one of the groups. We further extend this result to give a lower bound on the joint error of any (approximately) fair classifiers, from the perspective of learning fair representations. To show that our lower bound is tight, assuming oracle access to Bayes (potentially unfair) classifiers, we also construct an algorithm that returns a randomized classifier which is both optimal (in terms of accuracy) and fair. Interestingly, when the protected attribute can take more than two values, an extension of this lower bound does not admit an analytic solution. Nevertheless, in this case, we show that the lower bound can be efficiently computed by solving a linear program, which we term as the TV-Barycenter problem, a barycenter problem under the TV-distance. On the upside, we prove that if the group-wise Bayes optimal classifiers are close, then learning fair representations leads to an alternative notion of fairness, known as the accuracy parity, which states that the error rates are close between groups. Finally, we also conduct experiments on real-world datasets to confirm our theoretical findings.
J. Chi, J. Shen, X. Dai, W. Zhang, Y. Tian, and H. Zhao
In Proceedings of the 25th International Conference on Artificial Intelligence and Statistics (AISTATS 2022)
[abs] [pdf]
Algorithmic decisions made by machine learning models in high-stakes domains may have lasting impacts over time. Unfortunately, naive applications of standard fairness criterion in static settings over temporal domains may lead to delayed and adverse effects. To understand the dynamics of performance disparity, we study a fairness problem in Markov decision processes (MDPs). Specifically, we propose return parity, a fairness notion that requires MDPs from different demographic groups that share the same state and action spaces to achieve approximately the same expected time-discounted rewards. We first provide a decomposition theorem for return disparity, which decomposes the return disparity of any two MDPs into the distance between group-wise reward functions, the discrepancy of group policies, and the discrepancy between state visitation distributions induced by the group policies. Motivated by our decomposition theorem, we propose algorithms to mitigate return disparity via learning a shared group policy with state visitation distributional alignment using integral probability metrics. We conduct experiments to corroborate our results, showing that the proposed algorithm can successfully close the disparity gap while maintaining the performance of policies on two real-world recommender system benchmark datasets.
S. Zhou, H. Zhao, S. Zhang, L. Wang, H. Chang, Z. Wang, and W. Zhu
In Proceedings of the 25th International Conference on Artificial Intelligence and Statistics (AISTATS 2022)
[abs] [pdf]
Models trained with offline data often suffer from continual distribution shifts and expensive labeling in changing environments. This calls for a new online learning paradigm where the learner can continually adapt to changing environments with limited labels. In this paper, we propose a new online setting -- Online Active Continual Adaptation, where the learner aims to continually adapt to changing distributions using both unlabeled samples and active queries of limited labels. To this end, we propose Online Self-Adaptive Mirror Descent (OSAMD), which adopts an online teacher-student structure to enable online self-training from unlabeled data, and a margin-based criterion that decides whether to query the labels to track changing distributions. Theoretically, we show that, in the separable case, OSAMD has an $O({T}^{1/2})$ dynamic regret bound under mild assumptions, which is even tighter than the lower bound $\Omega(T^{2/3})$ of traditional online learning with full labels. In the general case, we show a regret bound of $O({\alpha^*}^{1/3} {T}^{2/3} + \alpha^* T)$, where $\alpha^*$ denotes the separability of domains and is usually small. Our theoretical results show that OSAMD can fast adapt to changing environments with active queries. Empirically, we demonstrate that OSAMD achieves favorable regrets under changing environments with limited labels on both simulated and real-world data, which corroborates our theoretical findings.
B. Li, Y. Shen, Y. Wang, W. Zhu, C. J. Reed, J. Zhang, D. Li, K. Keutzer, and H. Zhao
In Proceedings of the 36th AAAI Conference on Artificial Intelligence (AAAI 2022)
[abs] [pdf] [poster]
Invariant risk minimization (IRM) has recently emerged as a promising alternative for domain generalization. Nevertheless, the loss function is difficult to optimize for nonlinear classifiers and the original optimization objective could fail when pseudo-invariant features and geometric skews exist. Inspired by IRM, in this paper we propose a novel formulation for domain generalization, dubbed invariant information bottleneck (IIB). IIB aims at minimizing invariant risks for nonlinear classifiers and simultaneously mitigating the impact of pseudo-invariant features and geometric skews. Specifically, we first present a novel formulation for invariant causal prediction via mutual information. Then we adopt the variational formulation of the mutual information to develop a tractable loss function for nonlinear classifiers. To overcome the failure modes of IRM, we propose to minimize the mutual information between the inputs and the corresponding representations. IIB significantly outperforms IRM on synthetic datasets, where the pseudo-invariant features and geometric skews occur, showing the effectiveness of proposed formulation in overcoming failure modes of IRM. Furthermore, experiments on DomainBed show that IIB outperforms $13$ baselines by $0.9\%$ on average across $7$ real datasets.
G. Zhang, H. Zhao, Y. Yu, and P. Poupart
In Proceedings of the 35th Advances in Neural Information Processing Systems (NeurIPS 2021)
[abs] [pdf] [poster] [code]
Out-of-distribution generalization is one of the key challenges when transferring a model from the lab to the real world. Existing efforts mostly focus on building invariant features among source and target domains. Based on invariant features, a high-performing classifier on source domains could hopefully behave equally well on a target domain. In other words, the invariant features are \emph{transferable}. However, in practice, there are no perfectly transferable features, and some algorithms seem to learn ''more transferable'' features than others. How can we understand and quantify such \emph{transferability}? In this paper, we formally define transferability that one can quantify and compute in domain generalization. We point out the difference and connection with common discrepancy measures between domains, such as total variation and Wasserstein distance. We then prove that our transferability can be estimated with enough samples and give a new upper bound for the target error based on our transferability. Empirically, we evaluate the transferability of the feature embeddings learned by existing algorithms for domain generalization. Surprisingly, we find that many algorithms are not quite learning transferable features, although few could still survive. In light of this, we propose a new algorithm for learning transferable features and test it over various benchmark datasets, including RotatedMNIST, PACS, Office-Home and WILDS-FMoW. Experimental results show that the proposed algorithm achieves consistent improvement over many state-of-the-art algorithms, corroborating our theoretical findings.
Z. Zhang, H. Wang, H. Zhao, H. Tong and H. Ji
In Proceedings of the 2021 Empirical Methods in Natural Language Processing (EMNLP 2021 Findings)
[abs] [pdf] [code]
Knowledge graph (KG) is a kind of efficient and informative representation of structured knowledge. A typical KG consists of a collection of knowledge triples, where each triple $(h, r, t)$ describes that the head entity $h$ and tail entity $t$ are connected through a relation $r$. Recently, extensive studies have been focusing on knowledge graph representation learning, which aims to learn low-dimensional entity and relation embeddings that are informative and scalable to use for many downstream applications, such as information retrieval~\cite{irkg}, recommendation systems~\cite{recommenderkg}, machine reading comprehension~\cite{machinereadingkg}, and query-answering systems~\cite{qakg,qakg2}. Typical KG embedding models, such as \cite{transe,conve,rotate}, usually learn the model parameters by maximizing pre-defined score functions on ground-truth triples. One major limitation of such methods is that each knowledge triple is modeled locally and independently, without considering the global contextual information of KGs. To solve this problem, another line of approaches~\cite{rgcn,compgcn} manages to model KGs as heterogeneous networks, and design message passing among entities using graph neural networks to better utilize global structural information.
J. Chi, Y. Tian, G. Gordon and H. Zhao
In Proceedings of the 38th International Conference on Machine Learning (ICML 2021)
[abs] [pdf] [code]
With the widespread deployment of large-scale prediction systems in high-stakes domains, e.g., face recognition, criminal justice, etc., disparity on prediction accuracy between different demographic subgroups has called for fundamental understanding on the source of such disparity and algorithmic intervention to mitigate it. In this paper, we study the accuracy disparity problem in regression. To begin with, we first propose an error decomposition theorem, which decomposes the accuracy disparity into the distance between marginal label distributions and the distance between conditional representations, to help explain why such accuracy disparity appears in practice. Motivated by this error decomposition and the general idea of distribution alignment with statistical distances, we then propose an algorithm to reduce this disparity, and analyze its game-theoretic optima of the proposed objective functions. To corroborate our theoretical findings, we also conduct experiments on five benchmark datasets. The experimental results suggest that our proposed algorithms can effectively mitigate accuracy disparity while maintaining the predictive power of the regression models.
H. Wang, H. Zhao, B. Li
In Proceedings of the 38th International Conference on Machine Learning (ICML 2021)
[abs] [pdf] [code]
Multi-task learning (MTL) aims to improve the generalization of several related tasks by learning them jointly. As a comparison, in addition to the joint training scheme, modern meta-learning allows unseen tasks with limited labels during the test phase, in the hope of fast adaptation over them. Despite the subtle difference between MTL and meta-learning in the problem formulation, both learning paradigms share the same insight that the shared structure between existing training tasks could lead to better generalization and adaptation. In this paper, we take one important step further to understand the close connection between these two learning paradigms, through both theoretical analysis and empirical investigation. Theoretically, we first demonstrate that MTL shares the same optimization formulation with a class of gradient-based meta-learning (GBML) algorithms. We then prove that for over-parameterized neural networks with sufficient depth, the learned predictive functions of MTL and GBML are close. In particular, this result implies that the predictions given by these two models are similar over the same unseen task. Empirically, we corroborate our theoretical findings by showing that, with proper implementation, MTL is competitive against state-of-the-art GBML algorithms on a set of few-shot image classification benchmarks. Since existing GBML algorithms often involve costly second-order bi-level optimization, our first-order MTL method is an order of magnitude faster on large-scale datasets such as mini-ImageNet. We believe this work could help bridge the gap between these two learning paradigms, and provide a computationally efficient alternative to GBML that also supports fast task adaptation.
P. Liao, H. Zhao, K. Xu, T. Jaakkola, G. Gordon, S. Jegelka and R. Salakhutdinov
In Proceedings of the 38th International Conference on Machine Learning (ICML 2021)
[abs] [pdf] [code]
While the advent of Graph Neural Networks (GNNs) has greatly improved node and graph representation learning in many applications, the neighborhood aggregation scheme exposes additional vulnerabilities to adversaries seeking to extract node-level information about sensitive attributes. In this paper, we study the problem of protecting sensitive attributes by information obfuscation when learning with graph structured data. We propose a framework to locally filter out pre-determined sensitive attributes via adversarial training with the total variation and the Wasserstein distance. Our method creates a strong defense against inference attacks, while only suffering small loss in task performance. Theoretically, we analyze the effectiveness of our framework against a worst-case adversary, and characterize an inherent trade-off between maximizing predictive accuracy and minimizing information leakage. Experiments across multiple datasets from recommender systems, knowledge graphs and quantum chemistry demonstrate that the proposed approach provides a robust defense across various graph structures and tasks, while producing competitive GNN encoders for downstream tasks.
B. Li, Y. Wang, S. Zhang, D. Li, T. Darrell, K. Keutzer and H. Zhao
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021)
[abs] [pdf] [code]
The success of supervised learning hinges on the assumption that the training and test data come from the same underlying distribution, which is often not valid in practice due to potential distribution shift. In light of this, most existing methods for unsupervised domain adaptation focus on achieving domain-invariant representations and small source domain error. However, recent works have shown that this is not sufficient to guarantee good generalization on the target domain, and in fact, is provably detrimental under label distribution shift. Furthermore, in many real-world applications it is often feasible to obtain a small amount of labeled data from the target domain and use them to facilitate model training with source data. Inspired by the above observations, in this paper we propose the first method that aims to simultaneously learn invariant representations and risks under the setting of semi-supervised domain adaptation (Semi-DA). First, we provide a finite sample bound for both classification and regression problems under Semi-DA. The bound suggests a principled way to obtain target generalization, i.e. by aligning both the marginal and conditional distributions across domains in feature space. Motivated by this, we then introduce the LIRR algorithm for jointly \textbf{L}earning \textbf{I}nvariant \textbf{R}epresentations and \textbf{R}isks. Finally, extensive experiments are conducted on both classification and regression tasks, which demonstrates LIRR consistently achieves state-of-the-art performance and significant improvements compared with the methods that only learn invariant representations or invariant risks.
Y. H. Tsai*, M. Q. Ma*, M. Yang, H. Zhao, L-P Morency, and R. Salakhutdinov
In Proceedings of the 9th International Conference on Learning Representations (ICLR 2021)
[abs] [pdf] [video]
This paper introduces Relative Predictive Coding (RPC), a new contrastive representation learning objective that maintains a good balance among training stability, minibatch size sensitivity, and downstream task performance. The key to the success of RPC is two-fold. First, RPC introduces the relative parameters to regularize the objective for boundedness and low variance. Second, RPC contains no logarithm and exponential score functions, which are the main cause of training instability in prior contrastive objectives. We empirically verify the effectiveness of RPC on benchmark vision and speech self-supervised learning tasks. Lastly, we relate RPC with mutual information (MI) estimation, showing RPC can be used to estimate MI with low variance.
P. Li, Y. Wang, H. Zhao, P. Hong, H. Liu
In Proceedings of the 9th International Conference on Learning Representations (ICLR 2021)
[abs] [pdf] [video]
Disparate impact has raised serious concerns in machine learning applications and its societal impacts. In response to the need of mitigating discrimination, fairness has been regarded as a crucial property in algorithmic design. In this work, we study the problem of disparate impact on graph-structured data. Specifically, we focus on dyadic fairness, which articulates a fairness concept that a predictive relationship between two instances should be independent of the sensitive attributes. Based on this, we theoretically relate the graph connections to dyadic fairness on link predictive scores in learning graph neural networks, and reveal that regulating weights on existing edges in a graph contributes to dyadic fairness conditionally. Subsequently, we propose our algorithm, \textbf{FairAdj}, to empirically learn a fair adjacency matrix with proper graph structural constraints for fair link prediction, and in the meanwhile preserve predictive accuracy as much as possible. Empirical validation demonstrates that our method delivers effective dyadic fairness in terms of various statistics, and at the same time enjoys a favorable fairness-utility tradeoff.
J. Shen, H. Zhao, W. Zhang and Y. Yu
In Proceedings of the 34th Advances in Neural Information Processing Systems (NeurIPS 2020, Spotlight)
[abs] [pdf] [code] [video]
Model-based reinforcement learning methods learn a dynamics model with real data sampled from the environment and leverage it to generate simulated data to derive an agent. However, due to the potential distribution mismatch between simulated data and real data, this could lead to degraded performance. Despite much effort being devoted to reducing this distribution mismatch, existing methods fail to solve it explicitly. In this paper, we investigate how to bridge the gap between real and simulated data due to inaccurate model estimation for better policy optimization. To begin with, we first derive a lower bound of the expected return, which naturally inspires a bound maximization algorithm by aligning the simulated and real data distributions. To this end, we propose a novel model-based reinforcement learning framework AMPO, which introduces unsupervised model adaptation to minimize the integral probability metric (IPM) between feature distributions from real and simulated data. Instantiating our framework with Wasserstein-1 distance gives a practical model-based approach. Empirically, our approach achieves state-of-the-art performance in terms of sample efficiency on a range of continuous control benchmark tasks.
Y. H. Tsai, H. Zhao, M. Yamada, L-P. Morency and R. Salakhutdinov
In Proceedings of the 34th Advances in Neural Information Processing Systems (NeurIPS 2020, Spotlight)
[abs] [pdf] [video]
Since its inception, the neural estimation of mutual information (MI) has demonstrated the empirical success of modeling expected dependency between high-dimensional random variables. However, MI is an aggregate statistic and cannot be used to measure point-wise dependency between different events. In this work, instead of estimating the expected dependency, we focus on estimating point-wise dependency (PD), which quantitatively measures how likely two outcomes co-occur. We show that we can naturally obtain PD when we are optimizing MI neural variational bounds. However, optimizing these bounds is challenging due to its large variance in practice. To address this issue, we develop two methods (free of optimizing MI variational bounds): Probabilistic Classifier and Density-Ratio Fitting. We demonstrate the effectiveness of our approaches in 1) MI estimation, 2) self-supervised representation learning, and 3) cross-modal retrieval task.
R. Combes*, H. Zhao*, Y.X. Wang and G. Gordon
In Proceedings of the 34th Advances in Neural Information Processing Systems (NeurIPS 2020)
[abs] [pdf] [code] [video]
Adversarial learning has demonstrated good performance in the unsupervised domain adaptation setting, by learning domain-invariant representations. However, recent work has shown limitations of this approach when label distributions differ between the source and target domains. In this paper, we propose a new assumption, \textit{generalized label shift} ($\glsa$), to improve robustness against mismatched label distributions. $\glsa$ states that, conditioned on the label, there exists a representation of the input that is invariant between the source and target domains. Under $\glsa$, we provide theoretical guarantees on the transfer performance of any classifier. We also devise necessary and sufficient conditions for $\glsa$ to hold, by using an estimation of the relative class weights between domains and an appropriate reweighting of samples. Our weight estimation method could be straightforwardly and generically applied in existing domain adaptation (DA) algorithms that learn domain-invariant representations, with small computational overhead. In particular, we modify three DA algorithms, JAN, DANN and CDAN and evaluate their performance on standard and artificial DA tasks. Our algorithms outperform the base versions, with vast improvements for large label distribution mismatches. Our code is available at \url{https://tinyurl.com/y585xt6j}.
H. Zhao*, J. Chi*, Y. Tian and G. Gordon
In Proceedings of the 34th Advances in Neural Information Processing Systems (NeurIPS 2020)
[abs] [pdf] [video] [slides] [poster]
Crowdsourced data used in machine learning services might carry sensitive information about attributes that users do not want to share. Various methods have been proposed to minimize the potential information leakage of sensitive attributes while maximizing the task accuracy. However, little is known about the theory behind these methods. In light of this gap, we develop a novel theoretical framework for attribute obfuscation. Under our framework, we propose a minimax optimization formulation to protect the given attribute and analyze its inference guarantees against worst-case adversaries. Meanwhile, there is a tension between minimizing information leakage and maximizing task accuracy. To understand this, we prove an information-theoretic lower bound to precisely characterize the fundamental trade-off between accuracy and information leakage. We conduct experiments on two real-world datasets to corroborate the inference guarantees and validate the inherent trade-offs therein. Our results indicate that, among several alternatives, the adversarial learning approach achieves the best trade-off in terms of attribute obfuscation and accuracy maximization.
S. Zhao, X. Yue, S. Zhang, B. Li, H. Zhao, B. Wu, R. Krishna, J. Gonzalez, A. Sangiovanni-Vincentelli, S. and K. Keutzer
IEEE Transactions on Neural Networks and Learning Systems (TNNLS)
[abs] [pdf]
Large-scale labeled training datasets have enabled deep neural networks to excel across a wide range of benchmark vision tasks. However, in many applications, it is prohibitively expensive and time-consuming to obtain large quantities of labeled data. To cope with limited labeled training data, many have attempted to directly apply models trained on a large-scale labeled source domain to another sparsely labeled or unlabeled target domain. Unfortunately, direct transfer across domains often performs poorly due to the presence of domain shift or dataset bias. Domain adaptation is a machine learning paradigm that aims to learn a model from a source domain that can perform well on a different (but related) target domain. In this paper, we review the latest single-source deep unsupervised domain adaptation methods focused on visual tasks and discuss new perspectives for future research. We begin with the definitions of different domain adaptation strategies and the descriptions of existing benchmark datasets. We then summarize and compare different categories of single-source unsupervised domain adaptation methods, including discrepancy-based methods, adversarial discriminative methods, adversarial generative methods, and self-supervision-based methods. Finally, we discuss future research directions with challenges and possible solutions.
H. Zhao, J. Hu and A. Risteski
In Proceedings of the 37th International Conference on Machine Learning (ICML 2020)
[abs] [pdf] [video] [slides] [blog]
The goal of universal machine translation is to learn to translate between any pair of languages, given a corpus of paired translated documents for a small subset of all pairs of languages. Despite impressive empirical results and an increasing interest in massively multilingual models, theoretical analysis on translation errors made by such universal machine translation models is only nascent. In this paper, we formally prove certain impossibilities of this endeavour in general, as well as prove positive results in the presence of additional (but natural) structure of data. For the former, we derive a lower bound on the translation error in the many-to-one translation setting, which shows that any algorithm aiming to learn shared sentence representations among multiple language pairs has to make a large translation error on at least one of the translation tasks, if no assumption on the structure of the languages is made. For the latter, we show that if the paired documents in the corpus follow a natural encoder-decoder generative process, we can expect a natural notion of ``generalization'': a linear number of language pairs, rather than quadratic, suffices to learn a good representation. Our theory also explains what kinds of connection graphs between pairs of languages are better suited: ones with longer paths result in worse sample complexity in terms of the total number of documents per language pair needed. We believe our theoretical insights and implications contribute to the future algorithmic design of universal machine translation.
P. Li, H. Zhao and H. Liu
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2020)
[abs] [pdf]
Fair clustering aims to hide sensitive attributes during data partition by balancing the distribution of protected subgroups in each cluster. Existing work attempts to address this problem by reducing it to a classical balanced clustering with a constraint on the proportion of protected subgroups of the input space. However, the input space may limit the clustering performance, and so far only low-dimensional datasets have been considered. In light of these limitations, in this paper, we propose Deep Fair Clustering (DFC) to learn fair and clustering-favorable representations for clustering simultaneously. Our approach could effectively filter out sensitive attributes from representations, and also lead to representations that are amenable for the following cluster analysis. Theoretically, we show that our fairness constraint in DFC will not incur much loss in terms of several clustering metrics. Empirically, we provide extensive experimental demonstrations on four visual datasets to corroborate the superior performance of the proposed approach over existing fair clustering and deep clustering methods on both cluster validity and fairness criterion.
W. Wang, H. Zhao, H. Zhuang, N. Shah and R. Padman
In Proceedings of The Web Conference 2020 (WWW 2020, Oral)
[abs] [pdf]
Early identification of patients at risk for postoperative complications can facilitate timely workups and treatments and improve health outcomes. Currently, a widely-used surgical risk calculator online web system developed by the American College of Surgeons (ACS) uses patients’ static features, e.g. gender, age, to assess the risk of postoperative complications. However, the most crucial signals that reflect the actual postoperative physical conditions of patients are usually real-time dynamic signals, including the vital signs of patients (e.g., heart rate, blood pressure) collected from postoperative monitoring. In this paper, we develop a dynamic postoperative complication risk scoring framework (DyCRS) to detect the “at-risk” patients in a real-time way based on postoperative sequential vital signs and static features. DyCRS is based on adaptations of the Hidden Markov Model (HMM) that captures hidden states as well as observable states to generate a real-time, probabilistic, complication risk score. Evaluating our model using electronic health record (EHR) on elective Colectomy surgery from a major health system, we show that DyCRS significantly outperforms the state-of-the-art ACS calculator and real-time predictors with 50.16% area under precision-recall curve (AUCPRC) gain on average in terms of detection effectiveness. In terms of earliness, our DyCRS can predict 15hrs55mins earlier on average than clinician's diagnosis with the recall of 60% and precision of 55%. Furthermore, Our DyCRS can extract interpretable patients' stages, which are consistent with previous medical postoperative complication studies. We believe that our contributions demonstrate significant promise for developing a more accurate, robust and interpretable postoperative complication risk scoring system, which can benefit more than 50 million annual surgeries in the US by substantially lowering adverse events and healthcare costs.
H. Zhao, A. Coston, T. Adel and G. Gordon
In Proceedings of the 8th International Conference on Learning Representations (ICLR 2020, Spotlight)
NeurIPS 2019 Workshop on Machine Learning with Guarantees (NeurIPS 2019)
[abs] [pdf] [video] [slides] [code]
We propose a novel algorithm for learning fair representations that can simultaneously mitigate two notions of disparity among different demographic subgroups. Two key components underpinning the design of our algorithm are balanced error rate and conditional alignment of representations. We show how these two components contribute to ensuring accuracy parity and equalized false-positive and false-negative rates across groups without impacting demographic parity. Furthermore, we also demonstrate both in theory and on two real-world experiments that the proposed algorithm leads to a better utility-fairness trade-off on balanced datasets compared with existing algorithms on learning fair representations.
T. Adel, H. Zhao and R. E. Turner
In Proceedings of the 8th International Conference on Learning Representations (ICLR 2020)
[abs] [pdf] [video]
Approaches to continual learning aim to successfully learn a set of related tasks that arrive in an online manner. Recently, several frameworks have been developed which enable deep learning to be deployed in this learning scenario. A key modelling decision is to what extent the architecture should be shared across tasks. On the one hand, separately modelling each task avoids catastrophic forgetting but it does not support transfer learning and leads to large models. On the other hand, rigidly specifying a shared component and a task-specific part enables task transfer and limits the model size, but it is vulnerable to catastrophic forgetting and restricts the form of task-transfer that can occur. Ideally, the network should adaptively identify which parts of the network to share in a data driven way. Here we introduce such an approach called Continual Learning with Adaptive Weights (CLAW), which is based on probabilistic modelling and variational inference. Experiments show that CLAW achieves state-of-the-art performance on six benchmarks in terms of overall continual learning performance, as measured by classification accuracy, and in terms of addressing catastrophic forgetting.
H. Zhao and G. Gordon
In Proceedings of the 33rd Advances in Neural Information Processing Systems (NeurIPS 2019)
[abs] [pdf] [poster] [slides] [blog]
With the prevalence of machine learning in high-stakes applications, especially the ones regulated by anti-discrimination laws or societal norms, it is crucial to ensure that the predictive models do not propagate any existing bias or discrimination. Due to the ability of deep neural nets to learn rich representations, recent advances in algorithmic fairness have focused on learning fair representations with adversarial techniques to reduce bias in data while preserving utility simultaneously. In this paper, through the lens of information theory, we provide the first result that quantitatively characterizes the tradeoff between demographic parity and the joint utility across different population groups. Specifically, when the base rates differ between groups, we show that any method aiming to learn fair representations admits an information-theoretic lower bound on the joint error across these groups. To complement our negative results, we also prove that if the optimal decision functions across different groups are close, then learning fair representations leads to an alternative notion of fairness, known as the accuracy parity, which states that the error rates are close between groups. Finally, our theoretical findings are also confirmed empirically on real-world datasets.
H. Zhao*, J. Chi*, Y. Tian and G. Gordon
In NeurIPS 2019 Workshop on Machine Learning with Guarantees (NeurIPS 2019)
[abs] [pdf]
With the prevalence of machine learning services, crowdsourced data containing sensitive information poses substantial privacy challenges. Existing work focusing on protecting against membership inference attacks under the rigorous framework of differential privacy are vulnerable to attribute inference attacks. In light of the current gap between theory and practice, we develop a novel theoretical framework for privacy-preservation under the attack of attribute inference. Under our framework, we propose a minimax optimization formulation to protect the given attribute and analyze its privacy guarantees against arbitrary adversaries. On the other hand, it is clear that privacy constraint may cripple utility when the protected attribute is correlated with the target variable. To this end, we also prove an information-theoretic lower bound to precisely characterize the fundamental trade-off between utility and privacy. Empirically, we extensively conduct experiments to corroborate our privacy guarantee and validate the inherent trade-offs in different privacy preservation algorithms. Our experimental results indicate that the adversarial representation learning approaches achieve the best trade-off in terms of privacy preservation and utility maximization.
H. Zhao, R. Combes, K. Zhang and G. Gordon
In Proceedings of the 36th International Conference on Machine Learning (ICML 2019, Long Oral)
[abs] [pdf] [supplement] [poster] [slides] [blog]
Due to the ability of deep neural nets to learn rich representations, recent advances in unsupervised domain adaptation have focused on learning domain-invariant features that achieve a small error on the source domain. The hope is that the learnt representation, together with the hypothesis learnt from the source domain, can generalize to the target domain. In this paper, we first construct a simple counterexample showing that, contrary to common belief, the above conditions are not sufficient to guarantee successful domain adaptation. In particular, the counterexample (Fig. 1) exhibits \emph{conditional shift}: the class-conditional distributions of input features change between source and target domains. To give a sufficient condition for domain adaptation, we propose a natural and interpretable generalization upper bound that explicitly takes into account the aforementioned shift. Moreover, we shed new light on the problem by proving an information-theoretic lower bound on the joint error of \emph{any} domain adaptation method that attempts to learn invariant representations. Our result characterizes a fundamental tradeoff between learning invariant representations and achieving small joint error on both domains when the marginal label distributions differ from source to target. Finally, we conduct experiments on real-world datasets that corroborate our theoretical findings. We believe these insights are helpful in guiding the future design of domain adaptation and representation learning algorithms.
H. Zhao*, Y. H. Tsai*, R. Salakhutdinov and G. Gordon
In Proceedings of the 33rd Advances in Neural Information Processing Systems (NeurIPS 2019)
[abs] [pdf] [poster] [slides] [code]
Feed-forward neural networks can be understood as a combination of an intermediate representation and a linear hypothesis. While most previous works aim to diversify the representations, we explore the complementary direction by performing an adaptive and data-dependent regularization motivated by the empirical Bayes method. Specifically, we propose to construct a matrix-variate normal prior (on weights) whose covariance matrix has a Kronecker product structure. This structure is designed to capture the correlations in neurons through backpropagation. Under the assumption of this Kronecker factorization, the prior encourages neurons to borrow statistical strength from one another. Hence, it leads to an adaptive and data-dependent regularization when training networks on small datasets. To optimize the model, we present an efficient block coordinate descent algorithm with analytical solutions. Empirically, we demonstrate that the proposed method helps networks converge to local optima with smaller stable ranks and spectral norms. These properties suggest better generalizations and we present empirical results to support this expectation. We also verify the effectiveness of the approach on multiclass classification and multitask regression problems with various network structures.
H. Zhao, O. Stretcu, A. Smola and G. Gordon
In Proceedings of the 35th Conference on Uncertainty in Artificial Intelligence (UAI 2019)
Also In Learning with Limited Labeled Data: Weak Supervision and Beyond workshop at NIPS 2017
[abs] [pdf] [supplement] [poster]
We consider a multitask learning problem, in which several predictors are learned jointly. Prior research has shown that learning the relations between tasks, and between the input features, together with the predictor, can lead to better generalization and interpretability, which proved to be useful for applications in many domains. In this paper, we consider a formulation of multitask learning that learns the relationships both between tasks and between features, represented through a task covariance and a feature covariance matrix, respectively. First, we demonstrate that existing methods proposed for this problem present an issue that may lead to ill-posed optimization. We then propose an alternative formulation, as well as an efficient algorithm to optimize it. Using ideas from optimization and graph theory, we propose an efficient coordinate-wise minimization algorithm that has a closed form solution for each block subproblem. Our experiments show that the proposed optimization method is orders of magnitude faster than its competitors. We also provide a nonlinear extension that is able to achieve better generalization than existing methods.
Y. Xu*, H. Zhao*, X. Shi and N. B. Shah
In Proceedings of the 28th International Joint Conference on Artificial Intelligence (IJCAI 2019)
[abs] [pdf] [supplement] [Full arXiv version]
We consider peer review under a conference setting where there are conflicts between the reviewers and the submissions. Under such conflicts, reviewers can manipulate their reviews in a strategic manner to influence the final rankings of their own papers. Present-day peer-review systems are not designed to guard against such strategic behavior, beyond minimal (and insufficient) checks such as not assigning a paper to a conflicted reviewer. In this work, we address this problem through the lens of social choice, and present a theoretical framework for strategyproof and efficient peer review. Given the conflict graph which satisfies a simple property, we first present and analyze a flexible framework for reviewer-assignment and aggregation for the reviews that guarantees not only strategyproofness but also a natural efficiency property (unanimity). Our framework is based on the so-called partitioning method, and can be treated as a generalization of this type of method to conference peer review settings. We then empirically show that the requisite property on the (authorship) conflict graph is indeed satisfied in the ICLR-17 submissions data, and further demonstrate a simple trick to make the partitioning method more practically appealing under conference peer-review settings. Finally, we complement our positive results with negative theoretical results where we prove that under slightly stronger requirements, it is impossible for any algorithm to be both strategyproof and efficient.
C. Liang, J. Ye, H. Zhao, B. Pursel and C. Lee Giles
In Proceedings of the 12th International Conference on Educational Data Mining (EDM 2019)
[abs] [pdf]
Strict partial order is a mathematical structure commonly seen in relational data. One obstacle to extracting such type of relations at scale is the lack of large scale labels for building effective data-driven solutions. We develop an active learning framework for mining such relations subject to a strict order. Our approach incorporates relational reasoning not only in finding new unlabeled pairs whose labels can be deduced from an existing label set, but also in devising new query strategies that consider the relational structure of labels. Our experiments on concept prerequisite relations show our proposed framework can substantially improve the classification performance with the same query budget compared to other baseline approaches.
H. Zhao, J. Hu, Z. Zhu, A. Coates and G. Gordon
In Proceedings of the 18th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2019)
[abs] [pdf] [poster]
The ability to adapt to and learn from different domains and environments is crucial for agents to generalize. In this paper we propose a probabilistic framework for domain adaptation that blends both generative and discriminative modeling in a principled way. Under this framework, generative and discriminative models correspond to specific choices of the prior over parameters. By maximizing both the marginal and the conditional log-likelihoods, our models can use both labeled instances from the source domain as well as unlabeled instances from \emph{both} source and target domains. We show that the popular reconstruction loss of autoencoder corresponds to an upper bound of the negative marginal log-likelihoods of unlabeled instances, and give a generalization bound that explicitly incorporates it into the analysis. We instantiate our framework using neural networks, and build a concrete model, DAuto.
H. Zhao*, S. Zhang*, G. Wu, J. Costeira, J. Moura and G. Gordon
In Proceedings of the 32nd Advances in Neural Information Processing Systems (NeurIPS 2018)
[abs] [pdf] [supplement] [poster] [code]
While domain adaptation has been actively researched, most algorithms focus on the single-source-single-target adaptation setting. In this paper we propose new generalization bounds and algorithms under both classification and regression settings for unsupervised multiple source domain adaptation. Our theoretical analysis naturally leads to an efficient learning strategy using adversarial neural networks: we show how to interpret it as learning feature representations that are invariant to the multiple domain shifts while still being discriminative for the learning task. To this end, we propose multisource domain adversarial networks (MDAN) that approach domain adaptation by optimizing task-adaptive generalization bounds. To demonstrate the effectiveness of MDAN, we conduct extensive experiments showing superior adaptation performance on both classification and regression problems: sentiment analysis, digit classification, and vehicle counting.
H. Zhao and G. Gordon
In Proceedings of the 34th Conference on Uncertainty in Artificial Intelligence (UAI 2018)
[abs] [pdf] [supplement]
Symmetric nonnegative matrix factorization has found abundant applications in various domains by providing a symmetric low-rank decomposition of nonnegative matrices. In this paper we propose a Frank-Wolfe (FW) solver to optimize the symmetric nonnegative matrix factorization problem under a simplicial constraint, which has recently been proposed for probabilistic clustering. Compared with existing solutions, this algorithm is simple to implement, and has no hyperparameters to be tuned. Building on the recent advances of FW algorithms in nonconvex optimization, we prove an $O(1/\eps^2)$ convergence rate to $\eps$-approximate KKT points, via a tight bound $\Theta(n^2)$ on the curvature constant, which matches the best known result in unconstrained nonconvex setting using gradient methods. Numerical results demonstrate the effectiveness of our algorithm. As a side contribution, we construct a simple nonsmooth convex problem where the FW algorithm fails to converge to the optimum. This result raises an interesting question about necessary conditions of the success of the FW algorithm on convex problems.
H. Zhao, S. Zarar, I. Tashev and C. H. Lee
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2018, Oral)
[abs] [pdf] [slides]
We propose an end-to-end model based on convolutional and recurrent neural networks for speech enhancement. Our model is purely data-driven and does not make any assumptions about the type or the stationarity of the noise. In contrast to existing methods that use multilayer perceptrons (MLPs), we employ both convolutional and recurrent neural network architectures. Thus, our approach allows us to exploit local structures in both the frequency and temporal domains. By incorporating prior knowledge of speech signals into the design of model structures, we build a model that is more data-efficient and achieves better generalization on both seen and unseen noise. Based on experiments with synthetic data, we demonstrate that our model outperforms existing methods, improving PESQ by up to 0.6 on seen noise and 0.64 on unseen noise.
H. Zhao*, Y. H. Tsai*, R. Salakhutdinov and G. Gordon
In Uncertainty in Deep Learning workshop at UAI (UAI UDL 2018)
[abs] [pdf] [poster]
We propose an approximate empirical Bayes framework and an efficient algorithm for learning the weight matrix of deep neural networks. Empirically, we show the proposed method works as a regularization approach that helps generalization when training neural networks on small datasets.
H. Zhao*, S. Zhang*, G. Wu, J. Costeira, J. Moura and G. Gordon
In 6th International Conference on Learning Representations (ICLR 2018 workshop track)
[abs] [pdf] [poster]
While domain adaptation has been actively researched in recent years, most theoretical results and algorithms focus on the single-source-single-target adaptation setting. Naive application of such algorithms on multiple source domain adaptation problem may lead to suboptimal solutions. We propose a new generalization bound for domain adaptation when there are multiple source domains with labeled instances and one target domain with unlabeled instances. Compared with existing bounds, the new bound does not require expert knowledge about the target distribution, nor the optimal combination rule for multisource domains. Interestingly, our theory also leads to an efficient learning strategy using adversarial neural networks: we show how to interpret it as learning feature representations that are invariant to the multiple domain shifts while still being discriminative for the learning task. To this end, we propose two models, both of which we call multisource domain adversarial networks (MDANs): the first model optimizes directly our bound, while the second model is a smoothed approximation of the first one, leading to a more data-efficient and task-adaptive model. The optimization tasks of both models are minimax saddle point problems that can be optimized by adversarial training. To demonstrate the effectiveness of MDANs, we conduct extensive experiments showing superior adaptation performance on three real-world datasets: sentiment analysis, digit classification, and vehicle counting.
H. Zhao and G. Gordon
In Proceedings of the 31st Advances in Neural Information Processing Systems (NIPS 2017)
[abs] [pdf] [poster]
Bayesian online algorithms for Sum-Product Networks (SPNs) need to update their posterior distribution after seeing one single additional instance. To do so, they must compute moments of the model parameters under this distribution. The best existing method for computing such moments scales quadratically in the size of the SPN, although it scales linearly for trees. This unfortunate scaling makes Bayesian online algorithms prohibitively expensive, except for small or tree-structured SPNs. We propose a linear-time algorithm that works even when the SPN is a general directed acyclic graph (DAG). Our algorithm significantly broadens the applicability of Bayesian online algorithms for SPNs. There are three key ingredients in the design and analysis of our algorithm: 1). For each edge in the graph, we find a linear time reduction from the moment computation problem to a joint inference problem in SPNs. 2). Using the property that each SPN computes a multilinear polynomial, we construct an efficient procedure for polynomial evaluation by differentiation without expanding the network that may contain exponentially many positive monomials. 3). We propose a dynamic programming method to further reduce the computation of the moments of all the edges in the graph from quadratic to linear. We demonstrate the usefulness of our linear time moment computation algorithm by applying it to develop a linear time assume density filter (ADF) for SPNs.
T. Adel, H. Zhao and A. Wong
In Proceedings of the 31th AAAI Conference on Artificial Intelligence (AAAI 2017)
[abs] [pdf]
Domain adaptation addresses learning tasks where training is performed on data from one domain whereas testing is performed on data belonging to a different but related domain. Assumptions about the relationship between the source and target domains should lead to tractable solutions on the one hand, and be realistic on the other hand. Here we propose a generative domain adaptation model that allows for modeling different assumptions about this relationship, among which is a newly introduced assumption that replaces covariate shift with a possibly more realistic assumption without losing tractability due to the efficient variational inference procedure developed. In addition to the ability to model less restrictive relationships between source and target, modeling can be performed without any target labeled data (unsupervised domain adaptation). We also provide a Rademacher complexity bound of the proposed algorithm. We evaluate the model on the Amazon reviews and the CVC pedestrian detection datasets.
Y. H. Tsai, H. Zhao, R. Salakhutdinov and N. Jojic
In Time Series workshop at NIPS (NIPS TSW 2017)
[abs] [pdf] [slides] [poster]
The assumption that data samples are independently identically distributed is the backbone of many learning algorithms. Nevertheless, datasets often exhibit rich structures in practice, and we argue that there exist some unknown orders within the data instances. Aiming to find such orders, we introduce a novel Generative Markov Network (GMN) which we use to extract the order of data instances automatically. Specifically, we assume that the instances are sampled from a Markov chain. Our goal is to learn the transitional operator of the chain as well as the generation order by maximizing the generation probability under all possible data permutations. One of our key ideas is to use neural networks as a soft lookup table for approximating the possibly huge, but discrete transition matrix. This strategy allows us to amortize the space complexity with a single model and make the transitional operator generalizable to unseen instances. To ensure the learned Markov chain is ergodic, we propose a greedy batch-wise permutation scheme that allows fast training. Empirically, we evaluate the learned Markov chain by showing that GMNs are able to discover orders among data instances and also perform comparably well to state-of-the-art methods on the one-shot recognition benchmark task.
H. Zhao, P. Poupart and G. Gordon
In Proceedings of the 30th Advances in Neural Information Processing Systems (NIPS 2016)
[abs] [pdf] [supplement] [poster] [code]
We present a unified approach for learning the parameters of Sum-Product networks (SPNs). We prove that any complete and decomposable SPN is equivalent to a mixture of trees where each tree corresponds to a product of univariate distributions. Based on the mixture model perspective, we characterize the objective function when learning SPNs based on the maximum likelihood estimation (MLE) principle and show that the optimization problem can be formulated as a signomial program. Both the projected gradient descent (PGD) and the exponentiated gradient (EG) in this setting can be viewed as first order approximations of the signomial program after proper transformation of the objective function. Based on the signomial program formulation, we construct two parameter learning algorithms for SPNs by using sequential monomial approximations (SMA) and the concave-convex procedure (CCCP), respectively. The two proposed methods naturally admit multiplicative updates, hence effectively avoiding the projection operation. With the help of the unified framework, we also show that, in the case of SPNs, CCCP leads to the same algorithm as Expectation Maximization (EM) despite the fact that they are different in general. Extensive experiments on 20 data sets demonstrate the effectiveness and efficiency of the two proposed approaches for learning SPNs. We also show that the proposed methods can improve the performance of structure learning and yield state-of-the-art results.
H. Zhao, T. Adel, G. Gordon and B. Amos
In Proceedings of the 33rd International Conference on Machine Learning (ICML 2016)
[abs] [pdf] [poster] [slides] [code]
Sum-Product Networks (SPNs) are probabilistic inference machines that admit exact inference in linear time in the size of the network. Existing parameter learning approaches for SPNs are largely based on the maximum likelihood principle and hence are subject to overfitting compared to more Bayesian approaches. Exact Bayesian posterior inference for SPNs is computationally intractable. Both standard variational inference and posterior sampling for SPNs are computationally infeasible even for networks of moderate size due to the large number of local latent variables per instance. In this work, we propose a novel deterministic collapsed variational inference algorithm for SPNs that is computationally efficient, easy to implement and at the same time allows us to incorporate prior information into the optimization formulation. Extensive experiments show a significant improvement in accuracy compared with a maximum likelihood based approach.
P. Jaini, A. Rashwan, H. Zhao, Y. Liu, E. Banijamali, Z. Chen and P. Poupart
In Proceedings of the 8th International Conference on Probabilistic Graphical Models (PGM 2016)
[abs] [pdf]
Sum-product networks (SPNs) have recently emerged as an attractive representation due to their dual interpretation as a special type of deep neural network with clear semantics and a tractable probabilistic graphical model. We explore online algorithms for parameter learning in SPNs with continuous variables. More specifically, we consider SPNs with Gaussian leaf distributions and show how to derive an online Bayesian moment matching algorithm to learn from streaming data. We compare the resulting generative models to stacked restricted Boltzmann machines and generative moment matching networks on real-world datasets.
A. Rashwan, H. Zhao and P. Poupart
In Proceedings of the 19th International Conference on Artificial Intelligence and Statistics (AISTATS 2016)
[abs] [pdf]
Probabilistic graphical models provide a general and flexible framework for reasoning about complex dependencies in noisy domains with many variables. Among the various types of probabilistic graphical models, sum-product networks (SPNs) have recently generated some interest because exact inference can always be done in linear time with respect to the size of the network. This is particularly attractive since it means that learning an SPN from data always yields a tractable model for inference. However, existing parameter learning algorithms for SPNs operate in batch mode and do not scale easily to large datasets. In this work, we explore online algorithms to ensure that parameter learning can also be done tractably with respect to the amount of data. More specifically, we propose a new Bayesian moment matching (BMM) algorithm that operates naturally in an online fashion and that can be easily distributed. We demonstrate the effectiveness and scalability of BMM in comparison to online extensions of gradient descent and expectation maximization on 20 classic benchmarks and 4 large scale datasets.
H. Zhao, M. Melibari and P. Poupart
In Proceedings of the 32nd International Conference on Machine Learning (ICML 2015)
[abs] [pdf] [supplement] [Full arXiv version] [slides] [poster]
In this paper, we establish some theoretical connections between Sum-Product Networks (SPNs) and Bayesian Networks (BNs). We prove that every SPN can be converted into a BN in linear time and space in terms of the network size. The key insight is to use Algebraic Decision Diagrams (ADDs) to compactly represent the local conditional probability distributions at each node in the resulting BN by exploiting context-specific independence (CSI). The generated BN has a simple directed bipartite graphical structure. We show that by applying the Variable Elimination algorithm (VE) to the generated BN with ADD representations, we can recover the original SPN where the SPN can be viewed as a history record or caching of the VE inference process. To help state the proof clearly, we introduce the notion of {\em normal} SPN and present a theoretical analysis of the consistency and decomposability properties. We conclude the paper with some discussion of the implications of the proof and establish a connection between the depth of an SPN and a lower bound of the tree-width of its corresponding BN.
H. Zhao, Z. Lu and P. Poupart
In Proceedings of the 24th International Joint Conference on Artificial Intelligence (IJCAI 2015)
[abs] [pdf] [slides] [poster] [code]
The ability to accurately model a sentence at varying stages (e.g., word-phrase-sentence) plays a central role in natural language processing. As an effort towards this goal we propose a self-adaptive hierarchical sentence model (AdaSent). AdaSent effectively forms a hierarchy of representations from words to phrases and then to sentences through recursive gated local composition of adjacent segments. We design a competitive mechanism (through gating networks) to allow the representations of the same sentence to be engaged in a particular learning task (e.g., classification), therefore effectively mitigating the gradient vanishing problem persistent in other recursive models. Both qualitative and quantitative analysis shows that AdaSent can automatically form and select the representations suitable for the task at hand during training, yielding superior classification performance over competitor models on 5 benchmark data sets.
H. Zhao, P. Pouart, Y. Zhang and M. Lysy
In Proceedings of the 29th AAAI Conference on Artificial Intelligence (AAAI 2015)
[abs] [pdf] [poster]
We propose SoF (Soft-cluster matrix Factorization), a probabilistic clustering algorithm which softly assigns each data point into clusters. Unlike model-based clustering algorithms, SoF does not make assumptions about the data density distribution. Instead, we take an axiomatic approach to define 4 properties that the probability of co-clustered pairs of points should satisfy. Based on the properties, SoF utilizes a distance measure between pairs of points to induce the conditional co-cluster probabilities. The objective function in our framework establishes an important connection between probabilistic clustering and constrained symmetric Nonnegative Matrix Factorization (NMF), hence providing a theoretical interpretation for NMF-based clustering algorithms. To optimize the objective, we derive a sequential minimization algorithm using a penalty method. Experimental results on both synthetic and real-world datasets show that SoF significantly outperforms previous NMF-based algorithms and that it is able to detect non-convex patterns as well as cluster boundaries.
F. Faisal, H. Zhao and T. Milenkovic
IEEE/ACM Transactions on Computational Biology and Bioinformatics (IEEE/ACM TCBB 2014)
In Proceedings of the 4th ACM International Conference on Bioinformatics, Computational Biology and Biomedicine (ACM BCB 2013)
[abs] [pdf] [supplement]
Analogous to sequence alignment, network alignment (NA) can be used to transfer biological knowledge across species between conserved network regions. NA faces two algorithmic challenges: 1) Which cost function to use to capture “similarities” between nodes in different networks? 2) Which alignment strategy to use to rapidly identify “high-scoring” alignments from all possible alignments? We “break down” existing state-of-the-art methods that use both different cost functions and different alignment strategies to evaluate each combination of their cost functions and alignment strategies. We find that a combination of the cost function of one method and the alignment strategy of another method beats the existing methods. Hence, we propose this combination as a novel superior NA method. Then, since human aging is hard to study experimentally due to long lifespan, we use NA to transfer aging-related knowledge from well annotated model species to poorly annotated human. By doing so, we produce novel human aging-related knowledge, which complements currently available knowledge about aging that has been obtained mainly by sequence alignment. We demonstrate significant similarity between topological and functional properties of our novel predictions and those of known aging-related genes. We are the first to use NA to learn more about aging.
H. Zhao and P. Poupart
In Method of Moments and Spectral Learning workshop at ICML (ICML MM 2014)
[abs] [pdf] [slides] [poster] [code]
Spectral learning recently generated lots of excitement in machine learning, largely because it is the first known method to produce consistent estimates (under suitable conditions) for several latent variable models. In contrast, maximum likelihood estimates may get trapped in local optima due to the non-convex nature of the likelihood function of latent variable models. In this paper, we do an empirical evaluation of spectral learning (SL) and expectation maximization (EM), which reveals an important gap between the theory and the practice. First, SL often leads to negative probabilities. Second, EM often yields better estimates than spectral learning and it does not seem to get stuck in local optima. We discuss how the rank of the model parameters and the amount of training data can yield negative probabilities. We also question the common belief that maximum likelihood estimators are necessarily inconsistent.
Pre-prints
|
Group Fairness Meets the Black Box: Enabling Fair Algorithms on Closed LLMs via
Post-Processing
R. Xian, Y. Wan, H. Zhao arXiv preprint [abs] [pdf] Instruction fine-tuned large language models (LLMs) enable a simple zero-shot or few-shot prompting paradigm, also known as in-context learning, for building prediction models. This convenience, combined with continued advances in LLM capability, has the potential to drive their adoption across a broad range of domains, including high-stakes applications where group fairness -- preventing disparate impacts across demographic groups -- is essential. The majority of existing approaches to enforcing group fairness on LLM-based classifiers rely on traditional fair algorithms applied via model fine-tuning or head-tuning on final-layer embeddings, but they are no longer applicable to closed-weight LLMs under the in-context learning setting, which include some of the most capable commercial models today, such as GPT-4, Gemini, and Claude. In this paper, we propose a framework for deriving fair classifiers from closed-weight LLMs via prompting: the LLM is treated as a feature extractor, and features are elicited from its probabilistic predictions (e.g., token log probabilities) using prompts strategically designed for the specified fairness criterion to obtain sufficient statistics for fair classification; a fair algorithm is then applied to these features to train a lightweight fair classifier in a post-hoc manner. Experiments on five datasets, including three tabular ones, demonstrate strong accuracy-fairness tradeoffs for the classifiers derived by our framework from both open-weight and closed-weight LLMs; in particular, our framework is data-efficient and outperforms fair classifiers trained on LLM embeddings (i.e., head-tuning) or from scratch on raw tabular features. |
|
Towards Scalable Foundation Model for Multi-modal and Hyperspectral Geospatial Data
H. Si, Y. Wan, M. Do, D. Vasisht, H. Zhao, H. F. Hamann arXiv preprint [abs] [pdf] [project page] Geospatial raster (imagery) data, such as that collected by satellite-based imaging systems at different times and spectral bands, hold immense potential for enabling a wide range of high-impact applications. This potential stems from the rich information that is spatially and temporally contextualized across multiple channels (e.g., spectral bands, polarizations) and sensing modalities. Recent work has adapted existing self-supervised learning approaches for such geospatial data. However, they fall short of scalable model architectures, leading to inflexibility and computational inefficiencies when faced with an increasing number of channels and modalities. To address these limitations, we introduce Low-rank Efficient Spatial-Spectral Vision Transformer (LESS ViT) with three key innovations: i) the LESS Attention Block that approximates high-dimensional spatial-spectral attention through Kronecker’s product of the low-dimensional spatial and spectral attention components; ii) the Continuous Positional-Channel Embedding Layer that preserves both the continuity and physical characteristics of each spatial-spectral patch; and iii) the Perception Field Mask that exploits local spatial dependencies by constraining attention to neighboring patches. To evaluate the proposed innovations, we construct GFM-Bench, which serves as a comprehensive benchmark for such geospatial raster data. We pretrain LESS ViT using a Hyperspectral Masked Autoencoder framework with integrated positional and channel masking strategies. Experimental results demonstrate that our proposed method achieves competitive performance against state-of-the-art multi-modal geospatial foundation models while outperforming them on cross-satellite generalization tasks with higher computational efficiency. The flexibility and extensibility of our framework make it a promising direction for future geospatial data analysis tasks that involve a wide range of modalities and channels. Code and project page available at https://uiuctml.github.io/GeospatialFM/. |
|
Gradient-Based Multi-Objective Deep Learning:
Algorithms, Theories, Applications, and Beyond
W. Chen, X. Zhang, B. Lin, X. Lin, H. Zhao, Q. Zhang, J. T. Kwok arXiv preprint [abs] [pdf] [github] Multi-objective optimization (MOO) in deep learning aims to simultaneously optimize multiple conflicting objectives, a challenge frequently encountered in areas like multi-task learning and multi-criteria learning. Recent advancements in gradient-based MOO methods have enabled the discovery of diverse types of solutions, ranging from a single balanced solution to finite or even infinite Pareto sets, tailored to user needs. These developments have broad applications across domains such as reinforcement learning, computer vision, recommendation systems, and large language models. This survey provides the first comprehensive review of gradient-based MOO in deep learning, covering algorithms, theories, and practical applications. By unifying various approaches and identifying critical challenges, it serves as a foundational resource for driving innovation in this evolving field. A comprehensive list of MOO algorithms in deep learning is available at https://github.com/Baijiong-Lin/Awesome-Multi-Objective-Deep-Learning. |
|
Efficient Model Editing with Task Vector Bases: A Theoretical Framework and Scalable
Approach
S. Zeng, Y. He, W. You, Y. Hao, Y. H. Tsai, M. Yamada, H. Zhao arXiv preprint [abs] [pdf] Task vectors, which are derived from the difference between pre-trained and fine-tuned model weights, enable flexible task adaptation and model merging through arithmetic operations such as addition and negation. However, existing approaches often rely on heuristics with limited theoretical support, often leading to performance gaps comparing to direct task fine tuning. Meanwhile, although it is easy to manipulate saved task vectors with arithmetic for different purposes, such compositional flexibility demands high memory usage, especially when dealing with a huge number of tasks, limiting scalability. This work addresses these issues with a theoretically grounded framework that explains task vector arithmetic and introduces the task vector bases framework. Building upon existing task arithmetic literature, our method significantly reduces the memory cost for downstream arithmetic with little effort, while achieving competitive performance and maintaining compositional advantage, providing a practical solution for large-scale task arithmetic. |
|
Invariant-Feature Subspace Recovery: A New Class of Provable Domain Generalization
Algorithms
H. Wang, G. Balasubramaniam, H. Si, B. Li, H. Zhao arXiv preprint [abs] [pdf] Domain generalization asks for models trained over a set of training environments to generalize well in unseen test environments. Recently, a series of algorithms such as Invariant Risk Minimization (IRM) have been proposed for domain generalization. However, \citet{risks-of-IRM} shows that in a simple linear data model, even if non-convexity issues are ignored, IRM and its extensions cannot generalize to unseen environments with less than $d_s\mathrm{+}1$ training environments, where $d_s$ is the dimension of the spurious-feature subspace. In this work, we propose \textbf{I}nvariant-feature \textbf{S}ubspace \textbf{R}ecovery (ISR): a new class of algorithms to achieve provable domain generalization across the settings of classification and regression problems. First, in the binary classification setup of \citet{risks-of-IRM}, we show that our first algorithm, \textbf{ISR-Mean}, can identify the subspace spanned by invariant features from the first-order moments of the class-conditional distributions, and achieve provable domain generalization with $d_s\mathrm{+}1$ training environments. Our second algorithm, \textbf{ISR-Cov}, further reduces the required number of training environments to $\cO(1)$ using the information of second-order moments. Notably, unlike IRM, our algorithms bypass non-convexity issues and enjoy global convergence guarantees. Next, we extend ISR-Mean to the more general setting of multi-class classification and propose \textbf{ISR-Multiclass}, which leverages class information and provably recovers the invariant-feature subspace with $\lceil d_s/k \rceil + 1$ training environments for $k$-class classification. Finally, for regression problems, we propose \textbf{ISR-Regression} that can identify the invariant-feature subspace with $d_s + 1$ training environments. Empirically, we demonstrate the superior performance of our ISRs compared with IRM on synthetic benchmarks. Furthermore, ISRs can be used as simple yet effective post-processing methods for any given black-box feature extractors such as neural nets, and we show they can improve the worst-case accuracy of (pre-)trained models against spurious correlations and group shifts over multiple real-world datasets. |
|
Online Mirror Descent for Tchebycheff Scalarization in Multi-Objective
Optimization
M. Liu, X. Zhang, C. Xie, K. Donahue, H. Zhao arXiv preprint [abs] [pdf] [slides] The goal of multi-objective optimization (MOO) is to learn under multiple, potentially conflicting, objectives. One widely used technique to tackle MOO is through linear scalarization, where one fixed preference vector is used to combine the objectives into a single scalar value for optimization. However, recent work (Hu et al., 2024) has shown linear scalarization often fails to capture the non-convex regions of the Pareto Front, failing to recover the complete set of Pareto optimal solutions. In light of the above limitations, this paper focuses on Tchebycheff scalarization that optimizes for the worst-case objective. In particular, we propose an online mirror descent algorithm for Tchebycheff scalarization, which we call OMD-TCH. We show that OMD-TCH enjoys a convergence rate of O(\sqrt{\log m / T}) where m is the number of objectives and T is the number of iteration rounds. We also propose a novel adaptive online-to-batch conversion scheme that significantly improves the practical performance of OMD-TCH while maintaining the same convergence guarantees. We demonstrate the effectiveness of OMD-TCH and the adaptive conversion scheme on both synthetic problems and federated learning tasks under fairness constraints, showing state-of-the-art performance. |
People
Current (by alphabetical order)
| Weixin Chen (PhD in CS) |
| Yuen Chen (PhD in CS, co-advised with Hari Sundaram) |
| Xiwei Cheng (PhD in CS, co-advised with Ilan Shomorony) |
| Yifei He (PhD in CS) |
| Yuzheng Hu (PhD in CS) |
| Meitong Liu (PhD in CS) |
| Haozhe Si (PhD in ECE) |
| Siqi (Cindy) Zeng (PhD in CS, Anthropics Fellows) |
| Yuxuan Wan (UIUC MSCS) |
Alumni
| Seiyun Shin (PhD in ECE, co-advised with Ilan Shomorony, Mavis Future Faculty Fellows) |
| Ruicheng Xian (PhD in CS -> OpenAI) |
| Haoxiang Wang (PhD in ECE, co-advised with Bo Li, Mavis Future Faculty Fellows -> Nvidia) |
| Gargi Balasubramaniam (MSCS @ UIUC, Siebel Scholar -> Google DeepMind) |
| Ashutosh Sharma (MSCS, Siebel Scholar -> MIT-IBM Watson AI Lab) |
| Aditya Sinha (MSCS @ UIUC -> Netflix) |
| Qilong Wu (MSCS @ UIUC -> PhD in CS @ UIUC) |
| Sixian Du (undergrad in CS @ PKU -> Stanford MSEE) |
| Peiyuan (Alex) Liao (undergrad in CS @ CMU -> PhD in CS @ Stanford) |
| (Brian) Bo Li (undergrad in CS @ Harbin Institute of Technology -> PhD in CS @ Nanyang Technological University) |
| Samuel Schapiro (UIUC CS undergrad -> startup) |
| Jingyan Shen (Pinterest -> PhD in CS @ New York University) |
Teaching
| Term | Course | Location | Time |
|---|---|---|---|
| Fall 2025 | CS 598 - Foundations of Data Science | Siebel Center 1302 | TR 12:30PM - 1:45PM |
| Spring 2025 | CS 442 - Trustworthy Machine Learning | Siebel Center 1302 | TR 2PM - 3:15PM |
| Spring 2024 | CS 446 - Machine Learning | 1320 Digital Computer Laboratory | TR 12:30PM - 1:45PM |
| Fall 2023 | CS 442 - Trustworthy Machine Learning | 1310 Digital Computer Laboratory | WF 12:30PM - 1:45PM |
| Spring 2023 | CS 598: Transfer Learning | Siebel Center 0216 | WF 12:30PM - 1:45PM |
| Fall 2022 | CS 498 ML - Trustworthy Machine Learning | 4025 Campus Instructional Facility | TR 2PM - 3:15PM |
| Spring 2022 | CS 442 - Trustworthy Machine Learning | Siebel Center 1109 | WF 3:30PM - 4:45PM |
| Fall 2021 | CS 598 - Special Topics: Transfer Learning | Siebel Center 0216 | WF 2PM - 3:15PM |
Misc
| I enjoy sketching and calligraphy at my spare time. If I have a long vacation, I also enjoy traveling. My math genealogy can be found here. |