| CARVIEW |
Select Language
HTTP/2 200
server: GitHub.com
content-type: application/xml
last-modified: Tue, 09 Jan 2024 22:54:48 GMT
access-control-allow-origin: *
strict-transport-security: max-age=31556952
etag: W/"659dceb8-638e"
expires: Sun, 28 Dec 2025 20:18:30 GMT
cache-control: max-age=600
content-encoding: gzip
x-proxy-cache: MISS
x-github-request-id: 240E:318CF6:7E6895:8DECBB:69518E3D
accept-ranges: bytes
age: 0
date: Sun, 28 Dec 2025 20:08:30 GMT
via: 1.1 varnish
x-served-by: cache-bom-vanm7210087-BOM
x-cache: MISS
x-cache-hits: 0
x-timer: S1766952510.491874,VS0,VE220
vary: Accept-Encoding
x-fastly-request-id: fd1e211d4897ed6514ebf3e19d6caf334d201702
content-length: 7311
Jekyll 2024-01-09T22:54:35+00:00 https://jesscel.github.io/feed.xml Jessica Liu’s Personal Blog 🦭 Writing about my learning journey.
Jessica Liu liuec.jessica2000@gmail.com An Overview of NeurIPS 2023 Best Papers 2023-12-22T00:00:00+00:00 2023-12-22T00:00:00+00:00 https://jesscel.github.io/paper_reading/2023/12/22/neurips-2023-best-papers-overview <p><img src="https://raw.githubusercontent.com/jesscel/jesscel.github.io/master/assets/posts/2023-12-22-neurips-2023-best-papers-overview/cover.png" alt="A llama, a gopher, and a chinchilla attending NeurIPS - image generated by DALL-E" /></p>
<p>It has always been challenging for me to follow the latest trend in fast-growing fields like foundation models, especially when it comes to reading the latest papers. This year, thanks to the NeurIPS reading list by Jerry Liu, it became easier for me to follow the latest trends happening at NeurIPS.</p>
<p>In this reading note, I tried to capture the most important takeaways from each of the four NeurIPS best papers and hoped that it helps people navigate the relevant fields easier. I recommend reading the original paper if you find any of them particularly intriguing.</p>
<p>For each paper, I try to summarize it in one sentence, which captures the most important message I learned from the paper. Then, I provide a deeper dive into the sub-arguments, components, or evidence that I found important when reading it.</p>
<h1 id="are-emergent-abilities-of-large-language-models-amirage">Are Emergent Abilities of Large Language Models a Mirage?</h1>
<h2 id="one-sentence-takeaway">One-Sentence Takeaway</h2>
<p>Emergent abilities of LLMs claimed in prior works may be caused by researcher’s choice of metrics, specifically the use of metrics that nonlinearly or discontinuously scale with the model’s per-token error rate, rather than something inherent in the task or model families.</p>
<h2 id="closer-look">Closer Look</h2>
<p>Emergent abilities in LLMs are defined in this paper as having two properties:</p>
<ul>
<li>Sharpness - the transition from not present to present is instantaneous</li>
<li>Unpredictability - it’s hard to foresee the model scales at which these abilities appear</li>
</ul>
<p>The main argument of the paper is that the emergent abilities of LLMs are the result of scientist’s choice of metrics. Specifically, the paper makes the following major predictions regarding the appearance of emergent abilities:</p>
<p>While nonlinear and discontinuous metrics lead to emergent abilities, switching to linear and continuous metrics on the same model outputs makes smooth and predictable scaling curves.</p>
<p>Emergent abilities may also be caused by the insufficient resolution in test data and they disappear when we increase resolution by generating more test data.</p>
<p>Emergent abilities only appear under a few specific metrics (e.g. Multiple Choice Grade and Exact String Match), regardless of the task and model families.</p>
<p>In support of its main argument, the paper also shows that they are able to induce emergent abilities in vision models by changing evaluation metrics. They picked vision models because emergent abilities haven’t been observed in this class of models. Specifically, they induced an emergent reconstruction ability on shallow nonlinear autoencoders by changing the mean squared reconstruction error to a nonlinear reconstruction metric.</p>
<p>Image from the paper</p>
<h1 id="scaling-data-constrained-languagemodels">Scaling Data-Constrained Language Models</h1>
<h2 id="one-sentence-takeaway-1">One-Sentence Takeaway</h2>
<p>When the amount of unique data is constrained, it is beneficial to train the model for multiple epochs with repeating data, albeit with exponential decay in return.</p>
<h2 id="closer-look-1">Closer Look</h2>
<p>Scaling laws are a way people try to make scaling LLMs more predictable. This paper focuses on scaling LMs under data-constrained conditions. In particular, the paper quantifies the impact of multi-epoch training in LLMs in comparison with training LLMs on unique data for a single epoch, as recommended by prior works.</p>
<p>Image from the paper</p>
<p>The data-constrained scaling law states:</p>
<ul>
<li>(Allocation) Under the same data constraint, allocating most of the additional compute resources to more epochs rather than more parameters results in more reduction in loss.</li>
<li>(Return) Repeating data brings meaningful gains when repeated around 4 to 8 epochs (see the figure), after which we have predictable diminishing returns.</li>
</ul>
<p>To support further scaling, the paper also investigates complementary strategies to address data constraint. Specifically, the paper finds that code augmentation allows us to have an additional 2x data, which means the potential to scale an additional 2x.</p>
<h1 id="direct-preference-optimization-your-language-model-is-secretly-a-rewardmodel">Direct Preference Optimization: Your Language Model is Secretly a Reward Model</h1>
<h2 id="one-sentence-takeaway-2">One-Sentence Takeaway</h2>
<p>DPO is an LM fine-tuning method that directly optimizes for human preferences in a single step, eliminating the need for having a separate reward modeling step and an RL-based policy learning step as in RLHF.</p>
<h2 id="closer-look-2">Closer Look</h2>
<p>LLMs like GPT-3.5 and GPT-4 have shown impressive success in following human instructions thanks to the reinforcement learning from human feedback (RLHF) method. The standard RLHF involves three steps: 1) supervised fine-tuning (SFT) on instruction data; 2) reward modeling; and 3) reinforcement learning using the reward model from step 2.</p>
<p>Image from the paper</p>
<p>However, the RL training step is computationally expensive, unstable, and complicated to implement. To get around this complexity of RL, this paper proposes DPO, which matches and even exceeds the performance of RL-based methods by using simply a binary cross-entropy objective.</p>
<p>The main idea of DPO is that we can directly optimize the LM - i.e. the policy - to follow human preferences rather than explicitly training a reward model and then use RL training to optimize the policy.</p>
<p>Specifically, DPO takes the same RL objective from the RLHF methods and uses its optimal solution to express the reward model in terms of only the optimal and reference policies, as shown below.</p>
<p>Equation from the paper</p>
<p>This reparameterization is then substituted into the RL objective to obtain the DPO objective:</p>
<p>Equation from the paper</p>
<p>This objective allows DPO to essentially train the reward model (implicitly) and policy together in one step and bypass the computationally expensive reward modeling and RL training steps in RLHF methods.</p>
<h1 id="decodingtrust-a-comprehensive-assessment-of-trustworthiness-in-gptmodels">DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models</h1>
<h2 id="one-sentence-takeaway-3">One-Sentence Takeaway</h2>
<p>The paper presents a comprehensive evaluation of trustworthiness in GPT-3.5 and GPT-4, detailing evaluation data, metrics, methods, and results and suggesting that the GPT models still have trustworthiness vulnerabilities to be addressed.</p>
<h2 id="closer-look-3">Closer Look</h2>
<p>This paper is a thorough evaluation report of the trustworthiness of GPT3.5 and GPT4. In particular, it divides trustworthiness into 8 different criteria (detailed below). For each of the criteria, the paper provides detailed information on dataset construction, evaluation metrics, and results.
The 8 trustworthiness criteria and the main conclusions are described below:</p>
<p><strong>Toxicity</strong>
While GPT-3.5 and GPT-4 have much lower toxicity scores compared with previous models, they showed almost 100% toxicity probability when given adversarial system prompts.</p>
<p><strong>Stereotypes bias</strong>
Both GPT-3.5 and GPT-4 show low agreeability - i.e. the number of times a model agrees with a stereotypical statement - when given untargeted prompts, but they show high agreeability when given targeted adversarial prompts, especially for GPT-4.</p>
<p><strong>Adversarial robustness</strong>
GPT-4 is more robust than GPT-3.5, but both models are still vulnerable under adversarial texts generated by recent autoregressive models (from the proposed AdvGLUE++ dataset).</p>
<p><strong>Out-of-distribution robustness</strong>
GPT-4 is more robust than GPT-3.5 both when given texts with OOD style and when asked about OOD knowledge;
But both models are still vulnerable when given less common styles and still generate made-up responses when given OOD knowledge.</p>
<p><strong>Robustness on adversarial demonstrations</strong>
Both GPT-3.5 and GPT-4 benefit from counterfactual examples in demonstrations;
GPT-3.5 is more vulnerable to spurious correlations in demonstrations;
GPT-4 is more vulnerable to backdoor demonstrations.</p>
<p><strong>Privacy</strong>
GPT models can leak Personally Identifiable Information (PII) in training data and from prior conversations;
Both GPT-3.5 and GPT-4 leak almost everything when provided with privacy-leakage demonstrations under in-context learning.</p>
<p><strong>Machine ethics</strong>
Both GPT-3.5 and GPT-4 can be misled by jailbreaking prompts and evasive sentences;
GPT-4 is more vulnerable under jailbreaking prompts, potentially due to its better instruction-following abilities.</p>
<p><strong>Fairness</strong>
GPT-4 is more accurate under demographically balanced test data but demonstrates higher unfairness scores under demographically unbalanced test data compared to GPT-3.5;
The fairness of both GPT models can be improved by providing a few demographically balanced few-shot examples.</p>
<h1 id="references">References</h1>
<ul>
<li><a href="https://arxiv.org/pdf/2304.15004.pdf">📃 Are Emergent Abilities of Large Language Models a Mirage?</a></li>
<li><a href="https://arxiv.org/pdf/2305.16264.pdf">📃 Scaling Data-Constrained Language Models</a></li>
<li><a href="https://arxiv.org/pdf/2305.18290.pdf">📃 Direct Preference Optimization: Your Language Model is Secretly a Reward Model</a></li>
<li><a href="https://arxiv.org/pdf/2306.11698.pdf">📃 DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models</a></li>
<li><a href="https://www.youtube.com/watch?v=TK0-sitkCMw">📺 Scaling Data-Constrained Language Models - Talk at IST-Unbabel Seminar</a></li>
<li><a href="https://www.youtube.com/watch?v=LkED9wKI1TY">📺 Sophia Yang’s NeurIPS Best Paper Deep Dive Video</a></li>
</ul>
<hr />
<p>I hope you enjoyed this article! Connect with me on LinkedIn or Twitter if you are also interested in AI, ML, LLMs, databases, and more.</p> Jessica Liu liuec.jessica2000@gmail.com Importance Sampling Explained End-to-End 2023-10-28T00:00:00+00:00 2023-10-28T00:00:00+00:00 https://jesscel.github.io/technical_concepts/2023/10/28/importance-sampling-explained <p>Importance sampling is a useful technique when it’s infeasible for us to sample from the real distribution p, when we want to reduce variance of the current Monte Carlo estimator, or when we only know p up to a multiplicative constant.</p>
<p>I found it confusing when I first learned about importance sampling because it applies to so many different scenarios and there are few resources online that walk us through the end-to-end derivation and reasoning process.</p>
<p>In this post, I will try my best to provide such an explanation, starting from the Monte Carlo Methods and ending with an example walk-through.</p>
<h1 id="monte-carlomethods">Monte Carlo Methods</h1>
<p>Calculating expectation is an important task in machine learning. Mathematically, given an random variable x with probability density p(x), the expectation of a function of interest f(x) is computed as:</p>
<p><img src="https://raw.githubusercontent.com/jesscel/jesscel.github.io/master/assets/posts/2023-10-28-importance-sampling-explained/equation-1.jpg" alt="equation 1" /></p>
<p>At a high level, expectation can be understood as the probability weighted average of f(x) over the entire space where x lives. When x is discrete, it becomes:</p>
<p><img src="https://raw.githubusercontent.com/jesscel/jesscel.github.io/master/assets/posts/2023-10-28-importance-sampling-explained/equation-2.jpg" alt="equation 2" /></p>
<p>However, when x has high dimension, its probability space becomes exponentially large, making it infeasible to directly calculate the expectation using the integral above. Monte Carlo Methods address this problem by sampling from the probability distribution of x and estimating the expectation from the samples using this formula:</p>
<p><img src="https://raw.githubusercontent.com/jesscel/jesscel.github.io/master/assets/posts/2023-10-28-importance-sampling-explained/equation-3.jpg" alt="equation 3" /></p>
<p>It turns out that this estimation approaches the real expectation as N is large enough.</p>
<h1 id="bias-and-variance-of-an-estimator">Bias and Variance of an Estimator</h1>
<p>According to Wikipedia, “an estimator is a rule for calculating an estimate of a given quantity based on observed data”. Hence, the process of running the Monte Carlo method yields an estimator for the expectation.</p>
<p>It is important to distinguish between an estimator and a trial of the Monte Carlo method. We can think of an estimator as the aggregate result from running multiple trials of the Monte Carlo method. Due to the randomness in the sampling step, every time we run through the two-step process of sampling and estimating, we produce a slightly different estimation, which we call s, for the expectation. And the estimations themselves have a distribution, which has its own expectation and variance.</p>
<p>The Central Limit Theorem (CLT) states that</p>
<blockquote>
<p>“…for independent and identically distributed random variables, the sampling distribution of the standardized sample mean tends towards the standard normal distribution even if the original variables themselves are not normally distributed” - Wikipedia</p>
</blockquote>
<p>Translating that into our example, we can say that when N is large enough, the distribution of s is a normal distribution even though p(x) is not normally distributed.</p>
<p>Since Monte Carlo Method generates unbiased estimator for the expectation of a distribution, the expectation and variance of the estimator are computed as follows:</p>
<p><img src="https://raw.githubusercontent.com/jesscel/jesscel.github.io/master/assets/posts/2023-10-28-importance-sampling-explained/equation-4.jpg" alt="equation 4" /></p>
<p><img src="https://raw.githubusercontent.com/jesscel/jesscel.github.io/master/assets/posts/2023-10-28-importance-sampling-explained/equation-5.jpg" alt="equation 5" /></p>
<h1 id="importance-samplingis">Importance Sampling (IS)</h1>
<h2 id="how-does-iswork">How does IS work?</h2>
<p>Importance sampling introduces a new distribution q(x), allowing us to get a potentially better estimation of the expectation f(x). Specifically, we can rewrite the expectation formula as follows:</p>
<p><img src="https://raw.githubusercontent.com/jesscel/jesscel.github.io/master/assets/posts/2023-10-28-importance-sampling-explained/equation-6.jpg" alt="equation 6" /></p>
<p>The equation went through the following procedures:</p>
<ol>
<li>The integral is multiplied by q(x)/q(x), which is equal to 1 and doesn’t break anything. The only requirement is that we need to have q(x)>0 whenever p(x) is nonzero. This is because if q(x) is zero when p(x) is nonzero, the equation will not hold as some values of x will be missing from the integral.</li>
<li>Then, we reorder the terms and group f(x)(p(x)/q(x)) as the new function of interest whose expectation we are computing. This allows us to sample from a different distribution q(x) and still get an estimation of the same value - i.e. the expectation of f(x) over p(x) - as before.</li>
</ol>
<h2 id="why-does-ishelp">Why does IS help?</h2>
<p>In the simple term, IS helps because it 1) yields an unbiased estimator and 2) helps lower the variance of the estimator.</p>
<p>Say we use IS and estimate the expectation below:</p>
<p><img src="https://raw.githubusercontent.com/jesscel/jesscel.github.io/master/assets/posts/2023-10-28-importance-sampling-explained/equation-7.jpg" alt="equation 7" /></p>
<p>We can compute the expectation and variance of the new estimator r as follows:</p>
<p><img src="https://raw.githubusercontent.com/jesscel/jesscel.github.io/master/assets/posts/2023-10-28-importance-sampling-explained/equation-8.jpg" alt="equation 8" /></p>
<p><img src="https://raw.githubusercontent.com/jesscel/jesscel.github.io/master/assets/posts/2023-10-28-importance-sampling-explained/equation-9.jpg" alt="equation 9" /></p>
<p>We see that the new estimator r is still an unbiased estimator for the expectation of f(x) since the expectation of r is equal to the value being estimated.</p>
<p>Now that we have an unbiased estimator, let’s look at the variance. In reality, we might not be able to run the Monte Carlo method a large number of times due to cost constraint. Hence, a lower variance for the estimator indicates better quality. We can pick q(x) such that it results in lower variance; that is,</p>
<p><img src="https://raw.githubusercontent.com/jesscel/jesscel.github.io/master/assets/posts/2023-10-28-importance-sampling-explained/equation-10.jpg" alt="equation 10" /></p>
<p>To achieve this, we want q(x) to be large where |p(x)f(x)| is high.</p>
<p>To understand why such q(x) helps reduce the variance, think about a case where f(x) has very uneven distribution while p(x) is a uniform distribution. Also assume that we have a very limited budget and it’s infeasible for us to sample a large amount of examples or run many Monte Carlo trials.</p>
<p>As we see in the left picture, f(x) has a very small region that contributes the most to the expectation. When we cannot sample a large amount from p(x), it’s likely that we miss that small “important” region of f(x), resulting in a bad estimation.</p>
<figure>
<img src="https://raw.githubusercontent.com/jesscel/jesscel.github.io/master/assets/posts/2023-10-28-importance-sampling-explained/visualization.jpg" />
<figcaption style="text-align: center; color: #808080;"><em>Illustration of the effect of importance sampling.</em></figcaption>
</figure>
<p>If we pick q(x) such that it allows us to sample from the “important” region with higher probability, we see that the resulting ratio p(x)/q(x) that we multiply f(x) will effectively unweight the “important” regions that we put a higher probability into. This unweighting ratio is what makes the IS estimator unbiased and lowers the variance.</p>
<h2 id="example">Example</h2>
<p>To see how IS reduces variance, let’s look at an example with a discrete random variable X that can take on integer values in range [1,5]. Say p(X) is uniform over all values of X, so we have p(x) = 1/5.</p>
<p>Let’s define the function of interest f(x) as follows:</p>
<p><img src="https://raw.githubusercontent.com/jesscel/jesscel.github.io/master/assets/posts/2023-10-28-importance-sampling-explained/equation-11.jpg" alt="equation 11" /></p>
<p>We can compute the expectation and variance of the Monte Carlo estimator over p(x) as follows:</p>
<p><img src="https://raw.githubusercontent.com/jesscel/jesscel.github.io/master/assets/posts/2023-10-28-importance-sampling-explained/equation-12.jpg" alt="equation 12" /></p>
<p><img src="https://raw.githubusercontent.com/jesscel/jesscel.github.io/master/assets/posts/2023-10-28-importance-sampling-explained/equation-13.jpg" alt="equation 13" /></p>
<p><img src="https://raw.githubusercontent.com/jesscel/jesscel.github.io/master/assets/posts/2023-10-28-importance-sampling-explained/equation-14.jpg" alt="equation 14" /></p>
<p>Now say we have a new distribution q(x) defined as follows:</p>
<p><img src="https://raw.githubusercontent.com/jesscel/jesscel.github.io/master/assets/posts/2023-10-28-importance-sampling-explained/equation-15.jpg" alt="equation 15" /></p>
<p>Now, we can compute the expectation and variance of our estimator as follows:</p>
<figure>
<img src="https://raw.githubusercontent.com/jesscel/jesscel.github.io/master/assets/posts/2023-10-28-importance-sampling-explained/example.jpg" />
<figcaption style="text-align: center; color: #808080;"><em>p(x) and q(x).</em></figcaption>
</figure>
<p>We see that the variance given by the IS estimator is much lower than the one given by the uniform estimator.</p>
<h1 id="references">References</h1>
<ul>
<li><a href="https://www.youtube.com/watch?v=C3p2wI4RAi8&t=437s">📺 Importance Sampling by Mutual Information</a></li>
<li><a href="https://www.statlect.com/asymptotic-theory/importance-sampling">📝 Importance sampling by Marco Taboga</a></li>
</ul>
<hr />
<p>I hope you enjoyed this article! Connect with me on LinkedIn if you are also interested in AI, ML, databases, and more.</p> Jessica Liu liuec.jessica2000@gmail.com Importance sampling is a useful technique when it’s infeasible for us to sample from the real distribution p, when we want to reduce variance of the current Monte Carlo estimator, or when we only know p up to a multiplicative constant.