| CARVIEW |
Xiaoxia Wu
|
Email: shirley AT Together dot ai |
About me
I am currently a Senior Staff Scientist at TogetherAI (since July 2024), where I build tools and work on quantization. Ping me if you're interested in building and making inference fast!
Previously, I was a Senior Researcher at Microsoft GenAI in which we developed the Phi-3 family of models. I was fortunate to be a member of the DeepSpeed team, led by Zhewei Yao and Yuxiong He , and worked closely with Weizhu's Team. At DeepSpeed, I focused on system- and algorithm-level optimizations for large-scale training and inference of LLMs, with a particular emphasis on compression, long sequences, multi-modal research. Some of my projects includes DeepSpeed-FP6 and DeepSpeed-Chat . More information, please check deepspeed.ai.
I was a postdoctoral research fellow mentored by Rebecca Willett at University of Chicago and Toyota Techonological Institute at Chicago. I have successfully completed the Ph.D. program at The University of Texas at Austin, where I was fortunately advised by Rachel Ward and informally co-advised by Léon Bottou. My PhD research interests are in the areas of optimization method:it is about efficient and robust methods (to hyperparameter tuning) such as adaptive gradient descent and batch normalization. I was a research intern at Facebook AI Research (New York office) during Fall 2017, and a research intern at Google working with Ethan Dyer and Behnam Neyshabur during Summer 2020.
I hold an M.Sc. with Distinction at the University of Edinburgh in Financial Mathematics. Before that, I spent a wonderful four-year in the Department of Mathematics and Applied Mathematics at Shantou University where I was awarded Li-Kashing Scholarship to participate in Semester at Sea. I am from Guangdong, China, speaking Cantonese and Hakka.
Papers and Preprints (updated on Nov 2021)
- See my recent papers
-
Adaptive Differentially Private Empirical Risk Minimization
Xiaoxia Wu, Lingxiao Wang, Irina Cristali, Quanquan Gu, Rebecca Willett
arXiv:2110.07435
-
AdaLoss: A computationally-efficient and provably convergent adaptive gradient method
Xiaoxia Wu, Yuege Xie, Simon Du and Rachel Ward
arXiv:2109.08282
-
Hierarchical Learning for Generation with Long Source Sequences
Tobias Rohde, Xiaoxia Wu, and Yinhan Liu
arXiv:2104.07545
-
When Do Curricula Work?
Xiaoxia Wu, Ethan Dyer, and Behnam Neyshabur
ICLR (Oral, 53 papers accepted as oral out of 2997 submissions), 2021
[code, slides]
-
Implicit Regularization and Convergence for Weight Normalization
Xiaoxia Wu*, Edgar Dobriban*, Tongzheng Ren*, Shanshan Wu*, Yuanzhi Li, Suriya Gunasekar, Rachel Ward and Qiang Liu
NeurIPS, 2020
[slides]
-
Choosing the Sample with Lowest Loss makes SGD Robust
Vatsal Shah, Xiaoxia Wu, and Sujay Sanghavi
AISTATS, 2020
-
Linear Convergence of Adaptive Stochastic Gradient Descent
Yuege Xie, Xiaoxia Wu, and Rachel Ward
AISTATS, 2020
-
Global Convergence of Adaptive Gradient Methods for An Over-parameterized Neural Network
Xiaoxia Wu, Simon S. Du, and Rachel Ward
preprint, 2019
-
AdaGrad stepsizes: Sharp convergence over nonconvex landscapes
Rachel Ward*, Xiaoxia Wu*, Leon Bottou
ICML (Oral), 2019
(The longer version is published in Journal of Machine Learning Research)
[code, 20 mins video and slides, 机器之心报导]
-
WNGrad: Learn the Learning Rate in Gradient Descent
Xiaoxia Wu*, Rachel Ward*, Leon Bottou
preprint, 2018
-
An Optimal Mortgage Refinancing Strategy with Stochastic Interest Rate
Xiaoxia Wu, Dejun Xie, David A Edwards
Computational Economics, 1-23, 2018
-
Value-at-Risk estimation with stochastic interest rate models for option-bond portfolios
Xiaoyu Wang, Dejun Xie, Jingjing Jiang, Xiaoxia Wu, Jia He
Finance Research Letters 21 (2017): 10-20
*: indicating equal contribution.
Teaching Assistant at UT Austin
-
Probability I, Spring 19
-
Sci Computation in Numerical Analysis, Spring 18
-
Linear Algebra and Matrix Theory, Spring 17
-
Seq, Series, and Multivariate Calculus, Spring16, Fall16
-
Differential and Integral Calculus, Fall 14, Spring 15, Fall 16