Exporters From Japan
Wholesale exporters from Japan   Company Established 1983
CARVIEW
Select Language

Scaling Trends

Datastore Scaling

Previous work has shown scaling datastore is helpful for language modeling, while it remains unknown how it applies to downstream tasks. We show MassiveDS not only helps language modeling but also downstream tasks including MMLU.

Datastore Scaling overview.

Compute-Optimal Scaling

We find using larger datastores can significantly improve performance for the same training compute. Figure 4 shows compute-optimal scaling curves with OLMo and Pythia on 4 downstream tasks.

Compute-Optimal Scaling overview.

Analysis

Retriever is Robust to Out-of-Distribution Data in the Datastore

We compare the performance of MassiveDS with single-domain datastores in Table 3. The results show that MassiveDS matches or outperforms all single-domain datastores.

Single Domain overview.

We further investigate the source of this robustness and find that the retriever tends to retrieve more from the relevant domain depite the the existance of a large amout of out-of-distribution data, as shown in Figure 5.

Retrieval Distribution overview.

Overall, these results show that retrieving from broad datastores like MassiveDS can simultaneously improve performance across multiple domains, paving the path towards general-purpose retrieval-based models.

Impact of Reranker and Data Contamination

As shown in Figure 6, we find scaling trends can be improved with advanced retrieval, such as reranking. In addition, we find data contamination has a large impact on language modeling evaluation, as shown in Figure 7, so we advocate applying strict decontamination for RAG perplexity evaluation.

Analysis overview.

Future Directions

Our compute-optimal scaling trends indicate that retrieval-based language models (LMs) scale better than standalone LMs. Based on these findings, future research could focus on designing more effective training resource allocation strategies for retrieval-based LMs.

We demonstrate that storing trillion-token data in a datastore can effectively enhance language modeling and several downstream tasks. This approach introduces new challenges in serving retrieval-based language models (LMs) and highlights the need for efficient serving of large-scale datastores coupled with LMs. Future work could utilize MassiveDS to test new efficient index structures and systems.

There has been ongoing discussion comparing retrieval-based LMs to long-context LMs. Our work relates to this in that we retrieve data from trillions of tokens and prepend it to the context, which can be seen as a method to achieve a 1-trillion-token context length through a sparse context selection process. On the other hand, using a long-context LM enables us to include more retrieved tokens within the context. We are curious about potential follow-ups in this direction.

Reprduction and Datastore

All code is available at Github, and all datastore artifacts are available at Huggingface Space.

BibTeX

@article{shao2024scaling,
      title={Scaling Retrieval-Based Language Models with a Trillion-Token Datastore},
      author={Shao, Rulin and He, Jacqueline and Asai, Akari and Shi, Weijia and Dettmers, Tim and Min, Sewon and Zettlemoyer, Luke and Koh, Pang Wei},
      journal={arXiv preprint arXiv:2407.12854},
      year={2024}
    }