| CARVIEW |
Johnny Tian-Zheng Wei | 魏天正
Hi! I am a PhD candidate (since fall 2019) at the University of Southern California, where I am advised by Robin Jia. My current interest is in the legal issues of AI, and my work often takes basic concepts from inferential statistics. Previously, I earned a B.S. in Mathematics at the University of Massachusetts Amherst. When there are slow weeks at school, you might find me lifting (myself), swimming, biking, reading, writing, or learning. For all that I do, I want to do with love.
Check out our workshop on LLM memorization!
Email: jtwei [strudel] usc.edu
Links: [Twitter] [Github] [Google Scholar]
Selected publications
A full list of my publications can be found on Google Scholar.-
Hubble: a model suite to advance the study of LLM memorization
Johnny Tian-Zheng Wei*, Ameya Godbole*, Mohammad Aflah Khan*, Ryan Wang, Xiaoyuan Zhu, James Flemings, Nitya Kashyap, Krishna P. Gummadi, Willie Neiswanger, and Robin Jia.
In submission.
[abstract] [bib] [github] [website]We present Hubble, a suite of fully open-source large language models (LLMs) for the scientific study of LLM memorization. Hubble models come in standard and perturbed variants: standard models are pretrained on a large English corpus, and perturbed models are trained in the same way but with controlled insertion of text (e.g., book passages, biographies, and test sets) designed to emulate key memorization risks. Our core release includes 8 models -- standard and perturbed models with 1B or 8B parameters, pretrained on 100B or 500B tokens -- establishing that memorization risks are determined by the frequency of sensitive data relative to size of the training corpus (i.e., a password appearing once in a smaller corpus is memorized better than the same password in a larger corpus). Our release also includes 6 perturbed models with text inserted at different pretraining phases, showing that sensitive data without continued exposure can be forgotten. These findings suggest two best practices for addressing memorization risks: to dilute sensitive data by increasing the size of the training corpus, and to order sensitive data to appear earlier in training. Beyond these general empirical findings, Hubble enables a broad range of memorization research; for example, analyzing the biographies reveals how readily different types of private information are memorized. We also demonstrate that the randomized insertions in Hubble make it an ideal testbed for membership inference and machine unlearning, and invite the community to further explore, benchmark, and build upon our work.
-
Interrogating LLM design under a fair learning doctrine
Johnny Tian-Zheng Wei*, Maggie Wang*, Ameya Godbole, Jonathan H. Choi, and Robin Jia.
FAccT 2025.
[abstract] [bib]The current discourse on large language models (LLMs) and copyright largely takes a "behavioral" perspective, focusing on model outputs and evaluating whether they are substantially similar to training data. However, substantial similarity is difficult to define algorithmically and a narrow focus on model outputs is insufficient to address all copyright risks. In this interdisciplinary work, we take a complementary "structural" perspective and shift our focus to how LLMs are trained. We operationalize a notion of "fair learning" by measuring whether any training decision substantially affected the model's memorization. As a case study, we deconstruct Pythia, an open-source LLM, and demonstrate the use of causal and correlational analyses to make factual determinations about Pythia's training decisions. By proposing a legal standard for fair learning and connecting memorization analyses to this standard, we identify how judges may advance the goals of copyright law through adjudication. Finally, we discuss how a fair learning standard might evolve to enhance its clarity by becoming more rule-like and incorporating external technical guidelines.
-
Proving membership in LLM pretraining data via data watermarks
Johnny Tian-Zheng Wei*, Ryan Wang*, and Robin Jia.
Findings of ACL 2024.
[abstract] [github] [bib]Detecting whether copyright holders' works were used in LLM pretraining is poised to be an important problem. This work proposes using data watermarks to enable principled detection with only black-box model access, provided that the rightholder contributed multiple training documents and watermarked them before public release. By applying a randomly sampled data watermark, detection can be framed as hypothesis testing, which provides guarantees on the false detection rate. We study two watermarks: one that inserts random sequences, and another that randomly substitutes characters with Unicode lookalikes. We first show how three aspects of watermark design -- watermark length, number of duplications, and interference -- affect the power of the hypothesis test. Next, we study how a watermark's detection strength changes under model and dataset scaling: while increasing the dataset size decreases the strength of the watermark, watermarks remain strong if the model size also increases. Finally, we view SHA hashes as natural watermarks and show that we can robustly detect hashes from BLOOM-176B's training data, as long as they occurred at least 90 times. Together, our results point towards a promising future for data watermarks in real world use.
-
Operationalizing content moderation "accuracy" in the Digital Services Act
Johnny Tian-Zheng Wei, Frederike Zufall, and Robin Jia.
AIES 2024.
[abstract] [bib] [github]The Digital Services Act, recently adopted by the EU, requires social media platforms to report the "accuracy" of their automated content moderation systems. The colloquial term is vague, or open-textured -- the literal accuracy (number of correct predictions divided by the total) is not suitable for problems with large class imbalance, and the ground truth and dataset to measure accuracy against is unspecified. Without further specification, the regulatory requirement allows for deficient reporting. In this interdisciplinary work, we operationalize "accuracy" reporting by refining legal concepts and relating them to technical implementation. We start by elucidating the legislative purpose of the Act to legally justify an interpretation of "accuracy" as precision and recall. These metrics remain informative in class imbalanced settings while also reflecting the proportional balancing of Fundamental Rights of the EU Charter. We then focus on the estimation of recall, as its naive estimation can incur extremely high annotation costs and disproportionately interfere with the platform's right to conduct business. Through a simulation study, we show that recall can be efficiently estimated using stratified sampling with trained classifiers, and provide concrete recommendations for its application. Finally, we present a case study of recall reporting for a subset of Reddit under the Act. Based on the language in the Act, we identify a number of ways recall could be reported due to underspecification. We report on one possibility using our improved estimator, and discuss the implications and need for legal clarification.
-
Searching for a higher power in the human evaluation of MT
Johnny Tian-Zheng Wei, Tom Kocmi, and Christian Federmann.
WMT 2022.
[abstract] [bib] [github]In MT evaluation, pairwise comparisons are conducted to identify the better system. In conducting the comparison, the experimenter must allocate a budget to collect Direct Assessment (DA) judgments. We provide a cost effective way to spend the budget, but show that typical budget sizes often do not allow for solid comparison. Taking the perspective that the basis of solid comparison is in achieving statistical significance, we study the power (rate of achieving significance) on a large collection of pairwise DA comparisons. Due to the nature of statistical estimation, power is low for differentiating less than 1-2 DA points, and to achieve a notable increase in power requires at least 2-3x more samples. Applying variance reduction alone will not yield these gains, so we must face the reality of undetectable differences and spending increases. In this context, we propose interim testing, an “early stopping” collection procedure that yields more power per judgment collected, which adaptively focuses the budget on pairs that are borderline significant. Interim testing can achieve up to a 27% efficiency gain when spending 3x the current budget, or 18% savings at the current evaluation power.
-
The statistical advantage of automatic NLG metrics at the system level
Johnny Tian-Zheng Wei and Robin Jia.
ACL 2021.
[abstract] [bib] [github]Estimating the expected output quality of generation systems is central to NLG. This paper qualifies the notion that automatic metrics are not as good as humans in estimating system-level quality. Statistically, humans are unbiased, high variance estimators, while metrics are biased, low variance estimators. We compare these estimators by their error in pairwise prediction (which generation system is better?) using the bootstrap. Measuring this error is complicated: predictions are evaluated against noisy, human predicted labels instead of the ground truth, and metric predictions fluctuate based on the test sets they were calculated on. By applying a bias-variance-noise decomposition, we adjust this error to a noise-free, infinite test set setting. Our analysis compares the adjusted error of metrics to humans and a derived, perfect segment-level annotator, both of which are unbiased estimators dependent on the number of judgments collected. In MT, we identify two settings where metrics outperform humans due to a statistical advantage in variance: when the number of human judgments used is small, and when the quality difference between compared systems is small.
Policy
-
A Strategic Zero Draft: Standards for LLM Memorization
Johnny Tian-Zheng Wei and Robin Jia.
[call]
Miscellany
- Two books have greatly influenced my perspective on research. They are Geek Heresy by Kentaro Toyama, and The Worlds I See by Li Fei-Fei. Thank you both for sharing your stories.
- I recommend How to Avoid a Climate Disaster for everyone, especially if you're an undergraduate or younger.
- This webpage was adapted from Nelson's. Thanks!
