| CARVIEW |
Select Language
HTTP/2 200
server: GitHub.com
content-type: text/html; charset=utf-8
last-modified: Fri, 12 May 2023 01:16:55 GMT
access-control-allow-origin: *
strict-transport-security: max-age=31556952
etag: W/"645d9387-1a78"
expires: Tue, 30 Dec 2025 05:44:08 GMT
cache-control: max-age=600
content-encoding: gzip
x-proxy-cache: MISS
x-github-request-id: 2F89:272D88:9B075A:AE15C5:6953644F
accept-ranges: bytes
age: 0
date: Tue, 30 Dec 2025 05:34:08 GMT
via: 1.1 varnish
x-served-by: cache-bom-vanm7210052-BOM
x-cache: MISS
x-cache-hits: 0
x-timer: S1767072848.000433,VS0,VE218
vary: Accept-Encoding
x-fastly-request-id: c0ab742560fb733e06b7a5f8a313b197d0e2e922
content-length: 2979
Casey Meehan
Graduating in 2023! I'm a fifth year PhD candidate at UCSD advised by Kamalika Chaudhuri. My research focuses on personal data privacy in machine learning from two angles: 1) understanding and quantifying how large models memorize their training data, which can lead to leaking individuals' sensitive information, and 2) taking an application-specific approach to offering provable privacy in different ML domains.
Projects
- Casey Meehan, Florian Bordes, Pascal Vincent, Kamalika Chaudhuri, Chuan Guo"Do SSL Models Have Deja Vu? A Case of Unintended Memorization in Self-Supervised Learning" Preprint, 2023
SSL models have been shown to learn remarkably useful representations of images without access to any labels. However, it was unknown whether SSL models memorize their training images and if so, how. To answer this open question, we identified a new variety of data memorization, Déjà Vu. Given a small generic crop of an SSL training image (e.g. a patch of rippling water), we are able to generatively reconstruct the remainder of the image (e.g. the black swan floating next to that very patch of water) using a novel diffusion-based extraction method. To quantify the degree of Déjà Vu occurring in SSL models, we propose a variety of numerical tests that distinguish unwanted memorization behavior from expected correlation behavior. Given the jarring privacy implications of this finding, we propose potential mitigation strategies. LinkedIn Post
- Casey Meehan, Khalil Mrini, Kamalika Chaudhuri"Sentence-level Privacy for Document Embeddings" ACL, 2022
User language data can contain highly sensitive personal content. As such, it is imperative to offer users a strong and interpretable privacy guarantee when learning from their data. In this work, we propose SentDP: pure local differential privacy at the sentence level for a single user document. We propose a novel technique, DeepCandidate, that combines concepts from robust statistics and language modeling to produce high-dimensional, general-purpose $\epsilon$-SentDP LLM document embeddings. This guarantees that any single sentence in a document can be substituted with any other sentence while keeping the embedding $\epsilon$-indistinguishable.
- Casey Meehan, Kamalika Chaudhuri, Sanjoy Dasgupta. "A Non-Parametric Test to Detect Data-Copying in Generative Models" AISTATS, 2020
It is not clear how to determine whether a generative model is overfitting its dataset. This problem is exasperated by the fact that contemporary generative models have intractable likelihoods. In this project, we propose a new notion of overfitting, data-copying, wherein a generative model produces examples that are closer to its training set than a held out test set woudl be. See the companion blogpost here.
- Casey Meehan, Amrita Roy Chowdhury, Kamalika Chaudhuri, Somesh Jha. "Privacy Implications of Shuffling" ICLR, 2022
Here we formalize how non-uniform random shuffling of users' private data can provide a strong notion of inferential privacy (preventing inference about your data using others' data) while still preserving broad trends within the aggregate. For example, shuffling can block attacks that leverage your family's medical data to make inferences on your medical data.
- Tatsuki Koga, Casey Meehan, Kamalika Chaudhuri. "Privacy Amplification by Subsampling in Time Domain" AISTATS, 2022
Here we show how subsampling in the time domain -- a signal processing primitive -- provides privacy ampification, reducing the privacy cost while still availing underlying time-series trends. Using a novel analysis, we show the significant reduction in sensitivity provided by time-domain subsampling and propose a corresponding new class of privacy mechanisms.
- Casey Meehan, Kamalika Chaudhuri. "Location Trace Privacy Under Conditional Priors" AISTATS, 2021
In this project we analyze how to provide meaningful local privacy to sequences of individuals' locations when they are captured close together in time (traces). See corresponding blogpost here.