Data and Model Centric Approaches for Expansion of Large Language Models to New languages

Despite the increasing pace of Large Language Model (LLM) research, a vast majority of existing LLMs mainly support English alongside a handful of high resource languages, leaving a major gap for most low-resource languages. In this tutorial, we focus on approaches to expand the language coverage of LLMs. This provides an efficient and viable path to bring LLM technologies to low-resource languages, instead of training from scratch. We look at approaches at various stages of the LLM training pipeline, like tokenizer training, pre-training, instruction tuning, alignment, evaluation, etc., where adaptations are made to support new languages. We look at data-oriented approaches as well as model-oriented approaches. We hope that our tutorial enables researchers and practitioners to work on incorporating additional languages and tasks into existing LLMs to enhance inclusivity and coverage.

Data and Model Centric Approaches for Expansion of Large Language Models to New languages
- Tutorial @ EMNLP 2025

Abstract

Organizers

Data and Model Centric Approaches for Expansion of Large Language Models to New languages - Tutorial @ EMNLP 2025

Abstract

Organizers

Data and Model Centric Approaches for Expansion of Large Language Models to New languages
- Tutorial @ EMNLP 2025