Data and Model Centric Approaches for Expansion of Large Language Models to New languages - Tutorial @ EMNLP 2025

Anoop Kunchukuttan1,3, Raj Dabre1,2,4, Rudra Murthy5, Mohammed Safi Ur Rahman Khan1,2, Thanmay Jayakumar1,2
1Nilekani Centre at AI4Bharat 2Indian Institute of Technology Madras 3Microsoft 4Google DeepMind 5IBM Research
Saturday, Nov 8 2025, 14:00-17:30
Cover image generated by DALL-E 3

Abstract

Despite the increasing pace of Large Language Model (LLM) research, a vast majority of existing LLMs mainly support English alongside a handful of high resource languages, leaving a major gap for most low-resource languages. In this tutorial, we focus on approaches to expand the language coverage of LLMs. This provides an efficient and viable path to bring LLM technologies to low-resource languages, instead of training from scratch. We look at approaches at various stages of the LLM training pipeline, like tokenizer training, pre-training, instruction tuning, alignment, evaluation, etc., where adaptations are made to support new languages. We look at data-oriented approaches as well as model-oriented approaches. We hope that our tutorial enables researchers and practitioners to work on incorporating additional languages and tasks into existing LLMs to enhance inclusivity and coverage.

More details coming soon

Organizers

Anoop Kunchukuttan

Anoop Kunchukuttan

Principal Applied Researcher, Microsoft

Raj Dabre

Raj Dabre

Research Scientist, Google DeepMind

Rudra Murthy

Rudra Murthy

Research Scientist, IBM Research

Mohammed Safi Ur Rahman Khan

Mohammed Safi Ur Rahman Khan

PhD Student, AI4Bharat (IIT Madras)

Thanmay Jayakumar

Thanmay Jayakumar

Masters Student, AI4Bharat (IIT Madras)