Paris NLP saison 9 Meetup #3

Johan Leduc Senior – ML Engineer @Gitguardian
➡️ Uncovering Critical Secrets with LLMs
Summary: GitGuardian detects secrets—like passwords and API keys—in code, but the sheer volume can overwhelm users. Sifting through them to find the most critical ones is like searching for a needle in a haystack. In this talk, we’ll dive into how we leveraged LLMs at key stages to prioritize secrets efficiently and at scale.

👥 Louis Leconte – ML Research Engineer @Pruna AI

👥 How to quantize a LLM in 3 lines of code
➡️ Summary: Large Language Models (LLMs) are powerful but often computationally expensive, making deployment challenging. In this talk, we’ll explore how to quantize an LLM in just three lines of code using Pruna AI’s frictionless solution. I’ll introduce our data-free vector quantization approach, which optimizes CUDA kernels to enable efficient inference—all in under five minutes. Whether you’re working on edge AI, server-side deployments, or simply curious about making LLMs more efficient, this session will give you a hands-on glimpse into state-of-the-art quantization techniques.

Paris NLP saison 9 Meetup #2

Alexandre Brasseur – ML Engineer

Lost in the feed: how ask from billions of social media posts
Summary: Everyone knows what “RAG” means, but is it that simple ? Let’s embark in our journey on building a Conversational Search Engine from scratch designed to deliver quick insights from billions of social media posts and news articles ingested daily. In this talk we will share the initiatives and failures we went through in order to build a production-grade RAG system.

Léo Hunout & Kamel Guerda – AI Research Engineers @IDRIS

Jean Zay: Unlocking the Power of Supercomputing for Research
Summary: In this talk, we’ll introduce Jean Zay, the French supercomputer hosted at IDRIS, and explain how it can empower the NLP community. We will cover how researchers and developers working on open-source projects can access and leverage this powerful resource, as well as key strategies for optimizing the training and fine-tuning of large language models (LLMs) on supercomputers. Whether you’re new to supercomputing or exploring its possibilities for NLP, this session aims to demystify the process and equip you with the foundational knowledge and tools to get started.

Paris NLP saison 9 Meetup #1

Alexandre Défossez – Chief Exploration Officer @Kyutai

Moshi: a speech-text foundation model for real-time dialogue.
Summary: We will discuss Moshi, our recently released model. Moshi is capable of full-duplex dialogue, e.g. it can both speak and listen at any time, offering the most natural speech interaction to date. Besides, Moshi is also multimodal, in particular it is able to leverage its inner text monologue to improve the quality of its generation. We will cover the design choices behind Moshi in particular the efficient joint sequence modeling permitted by RQ-Transformer, and the use of large scale synthetic instruct data.

Louis Lacombe, Valentin Laurent, Thibault Cordier – Data Scientist @ Quantmetry – Part of Capgemini Invent

Enhancing NLP Model Reliability with MAPIE: Conformal Prediction for Uncertainty Quantification
Summary: This talk introduces MAPIE, an open-source Python library designed to quantify uncertainties and control risks in machine learning models, with a focus on NLP applications. We will begin by discussing the importance of uncertainty quantification based on conformal prediction framework that ensures guarantees with few assumptions. Then, we will present MAPIE, showcasing how to compute conformal prediction sets for NLP tasks like text classification. Finally, we will explore practical use cases, highlighting the capabilities of MAPIE and providing attendees with a comprehensive overview of its potential applications.

Paris NLP Season 8 Meetup #4

Publicis Sapient, Romain Benassi, Consultant Data Scientist

Title: Extraction of medicine characteristics from official documents
Summary : Few years ago, a project to set up a Data Science platform was carried out and enabled the industrialization of NLP models for extracting information from medical texts. The point was to develop strategic services such as giving access of relevant data, searching for medicine information or securing the prescriptions. This talk will focus on the constraints and technical issues surrounding the platform, as well as the different tools constituting it and the algorithms designed. A final discussion about how the emergence of GenAI might have changed the way such a project could be handled today will be conducted as well.

Algolia, Paul-Louis Nech, Research Machine Learning Engineer

Title: Where’s the beef? Evaluating the quality of content in your GenAI Project

Summary: So you created a GenAI project based on your company’s content and some LLMs. Is the generated content any good?
From eyeballing it during development to state-of-the-art user behavior tracking, there are many ways you can approach this.
In this talk, Paul-Louis will present a GenAI feature currently in Private beta at Algolia;
how they approached generating content for a diverse set of customers across the globe;
and a few techniques you can try for qualitative and quantitative assessments of your own generative projects!

Paris NLP Season 8 Meetup #3

Sergei Bogdanov – Data Scientist – Numind

Title: NuNER & NuSentiment – Creating efficient Foundation Models thanks to LLMs
Summary : How do you create small and data-efficient foundation models that are on-par with 7B LLMs? In this talk we will talk about how we managed to create NuNER & NuSentiment – 100M foundation models that outperform existing similar-sized models in few-shot learning Classification and Entity Recognition.

_______________

Raphaël Bournhonesque – Machine Learning Engineer – Open Food Facts


Title: Extracting ingredients from photos of food packaging: from LLM-augmented annotation to production
Summary : Raphaël from Open Food Facts will present their latest machine learning project: the automatic extraction of ingredient lists from photos of food packaging. He will share his experience of using LLMs to pre-annotate data and of how this model was integrated in production.

Paris NLP Season 8 Meetup #2

Thomas Scialom – Research Scientist (LLMs) – Meta AI


Title: LLMs: past present and future
Summary : Thomas will present the brief history of LLMs from GPTs to the lattest fronteer models, before diving into the science beyind RLHF, the technology powering ChatGPT and Llama-2. Finally, we will present his perspective on what could be the future of the field.

_______________
Francois Role – Université Paris Cité


Title: Aligning Text and Image Representations Using Vision-Language Pretrained Models
Summary : Data related to the same topic often comes in many modalities (audio, image, text, etc.) The ability to bridge the gap between these modalities is what makes possible applications such as multimodal information-retrieval, multimodal classification, automatic image captioning, etc. In this talk, we will present Vision-Language Pretrained Models (VLP models) that have been designed to jointly encode vision and langage, with a focus on bidirectional contrastive learning. We will first explain the loss function used in this context, starting from an intuitive example before presenting its formal definition. It will then be shown that training methods based on this kind of loss, while very useful, do not necessarily lead to an optimal alignment of the different modalities. We will therefore present a method for improving the quality of the text and image representations produced by the VLP models.

Paris NLP Season 8 Meetup #1

Florent Gbelidji – Hugging Face
Title: Customizing RAG System Components to Build Domain-Specific Assistant


Summary : Retrieval Augmented Generation (RAG) has become a prevalent approach in developing Large Language Models (LLM) applications, incorporating industry-specific data and the most recent information. In this session, we’ll delve into the mechanisms of RAG applications, focusing on key components like the retriever and the LLM. Our exploration will include leveraging tools from the open-source ecosystem to fine-tune these components, enhancing their performance in providing assistance, especially when confronted with domain-specific questions.

***

Guillaume Richard and Marie Lopez – InstaDeep
Title: The Nucleotide Transformer: Building and Evaluating Robust Foundation Models for Human Genomics


Summary : Closing the gap between measurable genetic information and observable traits is a longstanding challenge in genomics. Yet, the prediction of molecular phenotypes from DNA sequences alone remains limited and inaccurate, often driven by the scarcity of annotated data and the inability to transfer learnings between prediction tasks. Here, we present an extensive study of foundation models pre-trained on DNA sequences, named the Nucleotide Transformer, ranging from 50M up to 2.5B parameters and integrating information from 3,202 diverse human
genomes, as well as 850 genomes selected across diverse phyla, including both model and non-model organisms. These transformer models yield transferable, context-specific representations of nucleotide sequences, which allow for accurate molecular phenotype prediction even in low-data settings. We show that the developed models can be fine-tuned at low cost and despite low available data regime to solve a variety of genomics applications.
Despite no supervision, the transformer models learned to focus attention on key genomic elements, including those that regulate gene expression, such as enhancers. Lastly, we demonstrate that utilizing model representations can improve the prioritization of functional genetic variants. The training and application of foundational models in genomics explored in this study provide a widely applicable stepping stone to bridge the gap of accurate molecular phenotype prediction from DNA sequence.

Paris NLP Season 7 Meetup #3

Perceval Wajsburt – APHP

Multilingual normalization and structured entity composition in medical documents

Summary : Hospital clinical documents are a true treasure trove of information for various applications, from clinical research to epidemiological surveillance, medical coding, and decision support. However, their use for large-scale computer processing is hampered by the fact that they are predominantly written in natural language, requiring prior structuring. In this seminar, we will primarily focus on two tasks to structure these documents: entity normalization and structured entity extraction. We propose a large-scale multilingual approach to normalize named entities in multiple languages. Then, we explain how to compose simple entities into structured entities, using a novel method based on mention cliques and scope relations. Our evaluation is based on a new annotated corpus of clinical breast radiography reports. We will also discuss the challenges associated with applying deep learning in conditions of limited data, for languages other than English and in the clinical field.

Etienne Bernard – NuMind

Creating NLP Models in the Age of LLMs

Summary : Large Language Models (LLMs) have the potential to radically transform the way we tackle NLP applications, but it is still unclear how to use them best. We recently developed NuMind, a tool that leverages LLMs to efficiently create NLP models (e.g. classifiers and entity detectors) through a paradigm inspired by how humans teach each other, which we call Interactive AI Development. In this talk, I will present this paradigm, demonstrate the tool that we developed around it, and talk about the scientific & technological solutions used under the hood.

Paris NLP Season 7 Meetup #2

Nils Holzenberger, Associate Professor, Télécom Paris, Institut Polytechnique de Paris

Computational Statutory Reasoning

Summary : Statutory reasoning is the task of determining how laws apply to a legal case. This is a basic skill for lawyers, and in its computational form, a fundamental task for legal artificial intelligence systems. In this talk, I describe initial steps towards solving computational statutory reasoning. First, I define this task in the context of legal practice, and artificial intelligence more broadly. Second, I introduce the StAtutory Reasoning Assessment benchmark dataset (SARA). With the ability to measure performance on statutory reasoning, I show how a symbolic system can solve the task, while state-of-the-art machine reading struggles. Third, I connect statutory reasoning to established natural language processing tasks, in an attempt to diagnose machine reading errors. This yields more annotations on SARA and a performance boost compared to initial baselines, and opens up statutory reasoning to the general NLP community.

Marine Vinyes, Machine Learning Engineer Lead @ Criteo

How CLIP models can help you leverage image and text data ?

Two years ago, OpenAI released CLIP, a model that efficiently embeds image and text representations in a same space. Since then, many open source variants have surfaced. In this talk I will show how to use it to leverage your image and text data. As a use case I will explain how it is used at Criteo on billions of images and text from catalogs data.

Paris NLP Season 7 Meetup #1

Patrick Paroubek, LISN-Laboratoire Interdisciplinaire des Sciences du Numérique, CNRS-U.

Extraction d’information et analyse de sentiments guidée par les aspects (Aspect Based Sentiment Analysis ou ABSA)

Patrick Paroubek du laboratoire LISN (CNRS-U.Paris-Saclay) fera un retour d’expérience sur des travaux de recherche actuelles à la frontière des domaines du Traitement Automatique des Langues, (TAL) de l’Economie et de la Finance. Il abordera plus particulièrement des questions en rapport avec l’extraction d’information et l’analyse de sentiments guidée par les aspects (Aspect Based Sentiment Analysis ou ABSA) dans le cadre de deux thèses de doctorat en cours. Ce sera l’occasion de voir quelles ressources existent pour la langue française et quels progrès laissent envisager les avancées récentes en apprentissage neuronal profond.

Wissal El Achouri, Algoan

Credit decisioning : enrich open banking data using NLP

Since January 2018, an EU directive called PSD2 requested banks to enable the access to their data in a secure and standardised way, so that it can be more easily shared by customers. This directive enables the Open Banking in Europe.

The Open Banking data is a revolution for many financial services, and especially for lending. Algoan leverages the Open Banking to offer a credit decisioning API for financial institutions to help make adequate credit decisions for consumer loans.

In this talk, we focus on one of the enrichments that Algoan has been able to bring to Open Banking data: the categorisation of transactions. By categorisation, we refer to the process that associates a bank transaction to a category. A category describes the reason why a transaction has been executed.
Categorising transactions is a necessary step for making automatic and accurate credit decisions.

This problem is an NLP task, however it differs from most other NLP tasks in that the text related to a transaction is not structured as a human spoken language. Moreover, there are many challenges:
– the selection and labelling of a high volume of data,
– the design of a highly performant categorisation engine that covers the most transactions,
– the development of an efficient maintenance system to preserve a high level of precision in production
– and ensuring that the entire pipeline (labelling/training/deploying/ monitoring) is scalable internationally for foreign languages.

During this talk, we are going to explain the process that we have adopted to overcome these challenges and end up with a performant, well-monitored and scalable categorisation engine.