Airavata: Introducing Hindi Instruction-tuned LLM

Jay Gala, Thanmay Jayakumar, Jaavid Aktar Husain, Aswanth Kumar, Mohammed Safi Ur Rahman Khan, Diptesh Kanojia, Ratish Puduppully, Mitesh Khapra, Raj Dabre, Rudra Murthy, Anoop Kunchukuttan
25 January 2024
Cover image generated by DALL-E 3

Cover image for Airavata generated by DALL-E 3.

Cover image for Airavata instruction-tuned model generated from DALL-E

Examples of open-ended text generation by Hindi instruction-tuned Airavata model.

Introduction

The last year has witnessed tremendous interest and activity in the world of large language models (LLMs). LLMs hold the potential to unlock exciting applications in artificial intelligence, thanks to their ability to comprehend complex natural language instructions and excel in a broad spectrum of tasks involving language, knowledge, reasoning, and creative generation. To foster research, innovation, and widespread adoption, an open ecosystem is essential. We have observed significant advancements in this area with the launch of models like Llama 2 and Mistral, as well as their instruction-tuned variants such as Llama 2 Chat, Mistral-Instruct, and Zephyr, among others. Major progress has also been made in developing datasets for pre-training (e.g., RedPajama), instruction tuning (e.g., Alpaca, UltraChat, Dolly, OpenAssistant, LMSYS-Chat), and evaluation benchmarks (e.g., AlpacaEval, MT-Bench). However, most of these advancements have been predominantly centered on the English language.

There is some limited support for Indian languages, which can be attributed to the incidental inclusion of some Indian language data that slipped through the data filters during the pre-training of these language models. However, the representation of data, the efficacy of tokenizers, and task performance for Indian languages are considerably behind that of English. The performance in Indian languages, even on closed-source models such as ChatGPT (OpenAI et al., 2020), GPT-4 (Achiam et al., 2023), and others, is inferior compared to English (Ahuja et al., 2023). Therefore, there is an urgent need to develop a similar ecosystem of tools, models, and datasets for Indian languages to foster research and innovation. In pursuit of this objective, we have recently collaborated with Sarvam AI to launch OpenHathi (Sarvam et al., 2023), an open-source foundational model for Hindi, developed by extending Llama 2 (Touvron et al., 2023).

Today, we announce the next step - an initial release of "Airavata", an instruction-tuned model for Hindi built by finetuning OpenHathi (Sarvam et al., 2023) with diverse, instruction-tuning Hindi datasets to make it better suited for assistive tasks.

Along with the model, we also share the instruction tuning datasets to enable further research for IndicLLMs. We rely on human-curated, license-friendly instruction-tuned datasets to build "Airavata." We do not use data generated from proprietary models like GPT-4 etc. We think this is a more sustainable way of building instruction-tuned models at scale for most Indic languages, where relying on distilled data from commercial models would increase costs and restrict their free usage in downstream applications due to licensing restrictions.

We also compile a collection of evaluation benchmarks along with an evaluation framework to compare various LLMs for their abilities on diverse tasks when instructed in Hindi. Using this benchmark as well as human judgments, we compare different LLMs to quantify the current state of their Hindi capabilities. We conduct a detailed analysis of Airavata's performance on a variety of NLU and NLG tasks and find that the instruction finetuning helps align the model to a variety of NLU tasks. There is significant potential for improvement in NLG tasks, which requires creation of larger, more diverse instruction datasets as well as innnovations in aligning English model representations to Hindi representations to drive better cross-lingual transfer.

Instruction Tuning Dataset Creation

High-quality instruction tuning datasets are important for the good performance of LLMs. However, there are few diverse datasets for Hindi. Following Wei et al. (2023), we rely on translating high-quality English-supervised instruction-tuning datasets into Hindi. We use IndicTrans2 (Gala et al., 2023), the state-of-the-art open-source MT model for Indian languages, for translation. Some previous works (Li et al., 2023, Wei et al., 2023) have used ChatGPT (OpenAI et al., 2020) to translate instructions and/or generate responses into Hindi to better use context during translation (IndicTrans2 and most MT models are sentence-level). However, this is not cost-effective and the translation quality of ChatGPT is lower than IndicTrans2 (Gala et al., 2023), and its generation quality in Hindi might not be up to the mark (Ahuja et al., 2023). A future avenue of work would be improving translation quality when document context is available.

We sample examples from different datasets listed in Table 1 to ensure balanced representations across all the tasks while fitting into our instruction tuning budget. We translate the instructions, input, and outputs into Hindi. This results in a total of 404k examples spanning English and Hindi language. The translated Hindi examples were filtered to retain high-quality examples. Specifically, examples were retained only when the chrF++ score between the back-translated example and the corresponding English example was 50 or above. The final dataset used for instruction tuning contains 385k examples. Table 1 shows the details of the final training dataset. The dataset can be found on the 🤗 HuggingFace Hub.

We also create two native Hindi Instruction datasets:

  • wikiHow: wikiHow is an online wiki-style platform that serves as a valuable resource for a diverse array of how-to articles spanning numerous topics. The articles on the platform are human-moderated, thereby ensuring a high standard of quality. The questions posed by users in these articles closely align with potential use cases for this model, making it a rich resource for training models. Additionally, this might also help induce reasoning capabilities and generate logical step-by-step responses. We curate around 20k and 6k articles in English and Hindi respectively, resulting in a total of around 27k articles. We currently formulate the data as a completion task given either question or question along with a few initial steps.
  • Anudesh: Anudesh is a crowd-sourced collection of prompts accompanied by responses generated from the Llama-2 70B model. Participants are provided with clear guidelines detailing the nature of the interaction required, including the specific language to be employed. These languages encompass a range that includes Indic languages, English, transliterated Indic, as well as a blend of Indic and English in a code-mixed format. Contributors craft their prompts in adherence to these directives and the specified language criteria. Subsequently, these prompts are then paired with the corresponding translated outputs from the Llama 2 70B model. More details about the interactions will be released soon.


Table 1: Instruction Finetuning Training Dataset Details
Dataset Description Unfiltered Filtered License
English Hindi English Hindi
FLAN-v2
(Longpre et al., 2023)
A collection of NLP tasks that combines a number of existing NLP datasets with various data augmentations, introduced by Chung et al. (2022). We sample around 67K examples for our training mixture. 67,463 67,463 67,463 65,228 Apache-2.0
Anthropic-HHH
(Bai et al., 2022)
A collection of human collected preference data for aligning the models to be helpful and harmless. We sample 5K conversation from "chosen" column for our training mixture 5,000 5,000 5,000 4,911 MIT
Dolly
(Databricks et al., 2023)
A corpus of more than 15K records generated by thousands of Databricks employees to enable LLMs to exhibit the magical interactivity of ChatGPT. 15,011 15,011 15,011 14,880 CC-BY-SA-3.0
OpenAssistant
(Köpf et al., 2023)
A collection of human-generated, human-annotated assistant-style conversation corpus consisting of 38K messages, resulting in over 3K conversation trees and around 20K conversations. 19,945 20,128 19,945 16384 Apache-2.0
LymSys-Chat
(Zheng et al., 2023)
A collection of 1M real-world conversations spanning 25 SOTA LLMs similar to OpenAssistant. We sample 50K conversations for our training mixture. 50,000 50,000 50,000 37,422 LMSYS-Chat-1M Dataset License Agreement
wikiHow A collection of how-to-articles spanning a diverse range of daily life topics from an online wiki-style platform. 20,400 6,055 20,400 6,055 CC-0
Anudesh A collection of crowd-sourced prompts accompanied by responses generated from the Llama-2 70B model. 5,234 7,577 5,234 7,577 CC-BY-4.0
NMT
(Gala et al., 2023)
A multi-domain human-annotated dataset containing 50K bitext English-Hindi translation pairs from BPCC-Human (Gala et al., 2023) to enable better cross-lingual transfer. 50,000 - 50,000 - CC-BY-4.0

Supervised Fine-tuning

We fine-tune the OpenHathi model using the above-compiled datasets. We perform parameter-efficient finetuning with LoRA (Hu et al., 2022). The hyper-parameters used are listed in the table below.

Table 2: Hyperparameters for Fine-tuning
Hyper-Parameter Value
LoRA Rank 16
LoRA alpha 32
LoRA Dropout 0.05
LoRA Target Modules ["q_proj", "v_proj", "k_proj", "down_proj", "gate_proj", "up_proj"]
Epochs 4
Learning rate 5e-4
Batch Size 128
Floating Point Precision bfloat16

During fine-tuning, the loss was computed only for the output tokens. We used the OpenInstruct framework for fine-tuning, and customizing it for our requirements (our custom version is available as IndicInstruct). One fine-tuning example corresponds to one example in the dataset. However, this is suboptimal since many tokens are wasted as padding tokens. We plan to optimize this process by packing multiple dataset examples into a single fine-tuning example (Krell et al., 2023, Iyer et al., 2022).

Model Selection

We fine-tune the OpenHathi model for 4 epochs and save the model after each epoch. We evaluate each epoch’s checkpoint on the dev set and compare the average performance. We observe that the epoch 3 checkpoint performs well on NLU tasks and epoch 4 checkpoint performs well on NLG tasks. We perform checkpoint averaging, where we interpolate the weights of the above two checkpoints to obtain a model that performs well on both NLU as well as NLG tasks. We found the best interpolation weight to be around 0.6.

interpolated weights = 0.6 * checkpoint_3 + (1 - 0.6) * checkpoint_4

Full vs. LoRA finetuning

Full fine-tuning (FFT), where all the parameters of the model are updated, and LoRA fine-tuning, where a small subset of additional parameters are updated, are popular methods for instruction fine-tuning in large language models (LLMs). We fine-tuned two models: one using full fine-tuning and the other using LoRA fine-tuning, focusing on a portion of the instruction fine-tuning dataset, namely FLAN v2 English + Hindi. For our evaluation, we used a subset of Natural Language Understanding (NLU) tasks in Hindi, along with BoolQ and MMLU tasks in English, as development sets to decide between full fine-tuning and LoRA fine-tuning. We observed that models fully fine-tuned outperformed the OpenHathi base model in IndicCopa and IndicXParaphrase tasks. However, the fully fine-tuned model performed poorly on English tasks compared to both the base and LoRA models. LoRA fine-tuning either showed improvements or maintained the base model’s performance on both Hindi NLU and English tasks. Consequently, we chose to use LoRA fine-tuning for all our models. All results reported subsequently are for LoRA fine-tuned models.

Full v/s LoRA fine-tuning ablation results on few test sets.

Evaluation on NLP Benchmarks.

We evaluate our model on a variety of NLU and NLG tasks diversity. These include native Hindi test sets from IndicXTREME (Doddapaneni et al., 2023) and Indic NLG Suite (Kumar et al., 2022). To test the knowledge and reasoning capabilities of the model, we evaluate on the machine-translated version of the benchmarks such as MMLU (Hendrycks et al., 2021), Hellaswag (Zellers et al., 2019), ARC (Clark et al., 2018), Winogrande (Sakaguchi et al., 2019) and BoolQ (Clark et al., 2019). The translations were also done using IndicTrans2. While not perfect, these give an indication of the trends in LLM performance for Hindi. An important area of work is the creation of equivalent benchmarks for Hindi.

Results

The tables below shows the comparison of Airavata with the base model (OpenHathi) as well as with a translate-test approach using a strong English model (Llama 2 7B Chat). In the translate-test approach, we translate the Hindi input into English using IndicTrans2 before prompting the English model. We see that Airavata outperforms OpenHathi significantly for most tasks, showing that finetuning on the IndicInstruct dataset helps align the base model to a variety of tasks. The performance of translate-test varies a lot, while Airavata achieves more consistent performance. On translation, OpenHathi and Airavata have similar performance. OpenHathi is already trained on parallel corpora, hence the base model is already good at translation. The Airavata model retains that performance. Performance on generation tasks is a mixed bag, indicating the need for further improvement. Table 4 shows a comparison on English testsets along with the corresponding (machine translated) Hindi testset. We see that there exists a 5-15 point gap between English and Hindi accuracy across various tasks for both OpenHathi and Airavata. This indicates that English knowledge is not being transferred to Hindi, showing the need for better alignment between English and Hindi in the models.

Table 3: F1 scores on Indic NLU and Commonsense Reasoning tasks
0-Shot 5-Shot
OpenHathi Llama 2 7B Chat
(translate-test)
Airavata OpenHathi Llama 2 7B Chat
(translate-test)
Airavata
IndicSentiment 72.89 97.85 95.81 96.59 98.43 97.01
IndicCopa 68.69 76.53 63.75 42.77 78.34 72.97
IndicXNLI 16.67 23.67 73.26 42.25 47.96 74.7
IndicXParaphrase 71.72 9.54 76.53 66.67 48.56 69.87
Table 4: Accuracy on English NLU and Commonsense Reasoning tasks and its translated variants
Variant 0-Shot 5-Shot
OpenHathi Airavata OpenHathi Airavata
MMLU English 36.16 41.39 40.12 43.28
Hindi (Translated) 32.27 34.96 35.13 36
BoolQ English 52.63 73 64.46 62.02
Hindi (Translated) 58.56 64.5 65.69 51.47
ARC Easy English 57.28 70.5 62.12 71.04
Hindi (Translated) 44.28 54 49.87 54.84
Arc Challenge English 39.85 45.90 46.25 48.29
Hindi (Translated) 32.68 35.92 36.6 36.26
Hella Swag English 31.48 34.37 32.45 36.83
Hindi (Translated) 25.59 25.37 24.85 24.84
Winogrande English 49.17 49.72 - -
Table 5: chrF++ and BLEURT scores on English-Hindi translation task
Metric 0-Shot 5-Shot
OpenHathi Airavata OpenHathi Airavata
Flores chrF++ 55.41 54.82 54.98 54.24
BLEURT 0.7103 0.6970 0.7115 0.7084
IN22-Gen chrF++ 54.23 53.78 54.53 52.81
BLEURT 0.7156 0.7012 0.7181 0.7037
Table 6: F1, Rouge L and BLEURT scores on Indic NLG tasks
Metric 1-Shot
OpenHathi Llama 2 7B Chat
(translate-test)
Airavata
Indic QA (No Context) F1 17 4.58 21.01
Indic QA (With Context) 20.69 19.59 37.69
Indic Headline Rouge L 11.26 23.06 12.32
BLEURT 0.4682 0.4590 0.3793
IndicWikiBio Rouge L 20.45 41.01 10.66
BLEURT 0.5185 0.6415 0.4279

Examples

Examples where Airavata generates good output

Airavata performs well on some conent generation tasks like asking for mail drafts (#A1), as well for general how-to questions (#A3, #A6). It can follow instructions regarding output structure (#A4 and #A5).

Example #A1

Prompt

जिस ग्राहक को अपने ऑर्डर में देरी का अनुभव हुआ हो, उसके लिए एक माफी ईमेल ड्राफ्ट करें और आश्वस्त करें कि समस्या का समाधान हो गया है।

Model Response

प्रिय [ग्राहक का नाम], ।
मुझे आशा है कि यह ईमेल आपको अच्छी तरह से मिल जाएगा। मैं आपको यह सूचित करने के लिए लिख रहा हूं कि हमारे पिछले ऑर्डर में देरी के कारण, [ऑर्डर संख्या] के लिए आपकी डिलीवरी में थोड़ी देरी हुई है। मैं समझता हूं कि यह असुविधाजनक हो सकता है और मैं माफी मांगता हूं। ।
मैं यह सुनिश्चित करने के लिए हर संभव प्रयास कर रहा हूं कि इस मुद्दे को जल्द से जल्द हल किया जाए और भविष्य में ऐसी किसी भी घटना से बचने के लिए हम जो भी कदम उठा सकते हैं, वह उठा रहे हैं। ।
इस बीच, मैं आपको सूचित करना चाहता हूं कि [ऑर्डर संख्या] के लिए आपकी डिलीवरी अब [तिथि] पर उपलब्ध है। कृपया मुझे बताएं कि क्या आपके पास कोई प्रश्न या चिंताएँ हैं। ।
आपके धैर्य और समझ के लिए धन्यवाद। मैं भविष्य में आपके साथ काम करने के लिए तत्पर हूं। ।
ईमानदारी से, [आपका नाम]

Example #A2

Prompt

भारत में बड़ों के पैर छूना और उनसे बहस न करना नैतिक आचरण माना जाता था। इसके पीछे क्या तर्क है मुझे समझाओ

Model Response

"हम्म, यह एक दिलचस्प सवाल है! भारत में बड़ों को सम्मानित व्यक्ति माना जाता है और आमतौर पर बुजुर्गों की ओर रुख किया जाता है जब कोई कठिन निर्णय लेना होता है या जानकारी लेनी होती है। ऐसा इसलिए है क्योंकि बड़ों का ज्ञान और अनुभव पीढ़ियों से पारित किए गए हैं और इन्हें बहुमूल्य संपत्ति माना जाता है। इसलिए, बड़ों के साथ बातचीत करते समय सम्मानपूर्वक व्यवहार करना और उनके पैरों को छूने जैसे विनम्र इशारों का उपयोग करना महत्वपूर्ण माना जाता है। ।

यह याद रखना भी महत्वपूर्ण है कि बड़ों को अक्सर लोगों के लिए मार्गदर्शक और सलाहकार के रूप में देखा जाता है, और उन्हें आहत करने या अपमानित करने के बजाय उनकी सलाह लेने की कोशिश करनी चाहिए। यदि किसी बुजुर्ग ने आपको कुछ अनुचित कहा या उसके व्यवहार में कोई गलती की, तो उचित प्रतिक्रिया देना और स्थिति को ठीक से संभालना महत्वपूर्ण है।"

Example #A3

Prompt

मैने बहोत जगह नौकरी के लिए एप्लाई किया पर कही से रिप्लाई नही आया जिस्से मुझे डिप्रेशन हो रहा हे मुझे इसका कोई उपाय दो?

Model Response

यह सुनकर खेद है कि आपको अवसाद का सामना करना पड़ रहा है। यह याद रखना महत्वपूर्ण है कि अकेलेपन, चिंता और निराशा जैसी भावनाओं को दूर करने में मदद करने के कई तरीके हैं। यहाँ कुछ सुझाव दिए गए हैं जो सहायक लग सकते हैंः
1. किसी चिकित्सक या सलाहकार से बात करें-किसी मानसिक स्वास्थ्य पेशेवर से बात करना आपके अवसाद को प्रबंधित करने और इसे कम करने की दिशा में पहला कदम हो सकता है। वे आपकी भावनाओं को समझने और उन पर काबू पाने के लिए एक योजना बनाने में आपकी मदद कर सकते हैं।
2. नियमित व्यायामः शारीरिक गतिविधि एंडोर्फिन छोड़ती है, जो प्राकृतिक मनोदशा बढ़ाने वाले होते हैं। यहां तक कि छोटे पैमाने के व्यायाम जैसे चलना, तैरना या कुछ हल्का खिंचाव भी अवसादग्रस्तता की भावना को कम करने में मदद कर सकता है।
3. प्रकृति में समय बिताएँः प्रकृति में समय बिताने से तनाव को कम करने और कल्याण की भावना को बढ़ावा देने में मदद मिल सकती है। आप टहलने जा सकते हैं, पैदल यात्रा कर सकते हैं या बागवानी का काम करके प्रकृति में अपनी रुचि विकसित कर सकते हैं।
4. प्रियजनों के साथ जुड़ेंः दोस्तों, परिवार या सहकर्मियों के साथ जुड़ें जो समर्थन और प्रोत्साहन प्रदान करते हैं। सामाजिक बातचीत को प्रोत्साहित करने से अलगाव और अकेलेपन की भावनाओं को कम करने में मदद मिल सकती है।
5. रचनात्मक गतिविधियों में संलग्न रहेंः कला, संगीत या लेखन जैसी रचनात्मक गतिविधियों में शामिल होने से मन को विचलित करने और खुशी की भावना लाने में मदद मिल सकती है।
6. माइंडफुलनेस तकनीकों को आजमाएँः ध्यान, गहरी सांस लेने के अभ्यास या योग सहित विभिन्न माइंडफुलनेस तकनीकें आराम करने और वर्तमान क्षण पर ध्यान केंद्रित करने में मदद कर सकती हैं।
7. आत्म-देखभाल का अभ्यास करेंः पर्याप्त नींद लेना, संतुलित आहार खाना और स्वस्थ जीवन शैली बनाए रखने जैसी स्व-देखभाल प्रथाओं में शामिल होना आवश्यक है। इसमें स्नान करना, किताब पढ़ना या टीवी देखते हुए नाश्ता करना जैसी चीजें शामिल हो सकती हैं।
8. सहायता समूहों में भाग लेंः स्थानीय सहायता समूह या सामुदायिक संगठन खोजने का प्रयास करें जो आपके समान चुनौतियों से जूझ रहे हों। दूसरों से जुड़कर, आप अपने अवसाद को कम करने में मदद करने के तरीकों की पहचान कर सकते हैं।
9. पेशेवर मदद लेंः यदि आपके अवसाद का प्रबंधन करना बहुत कठिन लगता है, तो चिकित्सा या परामर्श प्राप्त करने पर विचार करें। एक प्रशिक्षित मानसिक स्वास्थ्य पेशेवर आपको मुकाबला तंत्र सीखने में मदद कर सकता है और आपको अधिक सकारात्मक मानसिकता विकसित करने में मदद कर सकता है। याद रखें, अवसाद को दूर करना आसान नहीं है, लेकिन सही समर्थन और रणनीतियों के साथ, आप प्रगति कर सकते हैं और बेहतर महसूस कर सकते हैं।

Example #A4

Prompt

मैं अपने समय प्रबंधन कौशल को कैसे सुधार सकता हूँ? मुझे पांच बिंदु बताएं।

Model Response

1. अपने लक्ष्यों को लिखें।
2. अपने लक्ष्यों को छोटे, प्रबंधनीय कार्यों में विभाजित करें।
3. एक अनुसूची बनाएँ और उस पर टिके रहें।
4. अपने कार्यों को प्राथमिकता दें।
5. अपने कार्यों को पूरा करने के लिए खुद को पुरस्कृत करें।

Example #A5

Prompt

मैं अपने समय प्रबंधन कौशल को कैसे सुधार सकता हूँ? मुझे पांच बिंदु बताएं और उनका वर्णन करें।

Model Response

1. अपने कार्यों को प्राथमिकता देंः उन कार्यों को प्राथमिकता दें जिन्हें पूरा करने की सबसे अधिक आवश्यकता है और उन्हें पहले पूरा करें।
2. एक अनुसूची बनाएँः एक अनुसूची बनाएँ और उस पर टिके रहें। यह आपको अपने समय का बेहतर प्रबंधन करने और अपने लक्ष्यों को प्राप्त करने में मदद करेगा।
3. ध्यान भटकाने से बचेंः अपने फोन को बंद कर दें और अपने काम पर ध्यान केंद्रित करें।
4. ब्रेक लेंः नियमित ब्रेक लेने से आपको रिचार्ज करने और उत्पादक बने रहने में मदद मिलेगी।
5. अपने आप को पुरस्कृत करेंः अपने आप को पुरस्कृत करने से आपको प्रेरित रहने और अपने लक्ष्यों को प्राप्त करने के लिए प्रेरित करने में मदद मिलेगी।

Example #A6

Prompt

मैं उत्तर भारतीय हूं और मूल हिंदी भाषी हूं। हाल ही में, मुझे अपनी नौकरी के लिए बैंगलोर शिफ्ट होना पड़ा। मुझे यहां के मूल निवासियों से बातचीत करने में थोड़ी परेशानी होती है क्योंकि वे कन्नड़ बोलते हैं और हिंदी नहीं जानते हैं। मैं कन्नड़ सीखना चाहूंगी ताकि मुझे इस समस्या का सामना न करना पड़े। मेरा मार्गदर्शन करें कि मैं खुद को कन्नड़ भाषा में कैसे शिक्षित कर सकता हूं

Model Response

आपके प्रश्न के लिए धन्यवाद! बेंगलुरु जाने पर आपको स्थानीय लोगों की भाषा सीखने में कुछ चुनौतियों का सामना करना पड़ सकता है, लेकिन चिंता मत करो-मदद के लिए यहाँ एक चरण-दर-चरण मार्गदर्शिका दी गई हैः
कन्नड़ सीखने के लिए ये कदम उठाए गए हैंः
कर्नाटक सरकार द्वारा प्रदान किए जा रहे मुफ्त कन्नड़ पाठ्यक्रम
2. कन्नड़ सीखने वाले मंच जैसे कि "कन्नड़ सीखें" या "अंगड़ी" जो कन्नड़ पाठों और अभ्यास कार्यों तक पहुंच प्रदान करते हैं।
3. कन्नड़ वर्णमाला, उच्चारण और शब्दावली सीखना शुरू करने के लिए कन्नड़ के बुनियादी शब्दों और वाक्यांशों को याद रखने का प्रयास करें। आप अपने दम पर ऑनलाइन संसाधनों का उपयोग करके या अन्य कन्नड़ बोलने वालों के साथ अभ्यास करके भी शुरुआत कर सकते हैं।
4. कन्नड़ व्याकरण और संरचनाओं के बारे में जानने के लिए कन्नड़ किताबें या पत्रिकाएं पढ़ना शुरू करें। आप यूट्यूब चैनलों पर कन्नड़ सबक या वार्तालाप सत्रों को देखने का विकल्प भी चुन सकते हैं जिनमें कन्नड़ भाषा शामिल हो सकती है।
5. कन्नड़ गीतों, फिल्मों और टीवी शो का आनंद लें जिन्हें आप समझ सकते हैं। यह आपको नए शब्दों और अभिव्यक्तियों को सुनने और उनका विश्लेषण करने में मदद करेगा।
6. दैनिक गतिविधियों और बातचीत में कन्नड़ का उपयोग करने का अभ्यास करें। उदाहरण के लिए, यदि आपके पास कॉफी शॉप जाना है, तो पहले वहाँ जाते समय कन्नड़ में कुछ छोटे वाक्यांश या अभिवादन आजमाएँ।
7. कन्नड़ में किसी दोस्त या परिवार के सदस्य से बात करें। उन्हें भाषा सिखाने और उनसे सवाल पूछने से आप नई चीजें सीखने और सुधार करने के अवसर प्राप्त करेंगे।
8. अन्य कन्नड़ भाषकों के साथ जुड़ें जिनका आप अनुसरण कर सकते हैं या उनकी सामग्री देख सकते हैं। चाहे वह कोई फेसबुक समूह हो या एक भाषा विनिमय कार्यक्रम, दूसरों के संपर्क में रहना आपकी प्रगति में सहायक होगा। याद रखें, एक अच्छी नींव के साथ, आप निश्चित रूप से जल्द ही कन्नड़ धाराप्रवाह होने में सक्षम होंगे। गुड लक!

Examples where Airavata output has errors

The model can hallucinate facts as in the plot of the novel in #B1 or fail to follow instruction (#B2)

Example #B1

Prompt

धरमवीर भारती जी का प्रसिद्ध उपन्यास "गुनाहों का देवता " के बारे में बताओ और कहनी के मुख्य किरदार सुधा और चंदर के बारे में संक्षिप्त वर्णन करें?

Model Response

गुनाहों का देवता एक लोकप्रिय भारतीय क्लासिक है जो 18वीं शताब्दी के अंत में लिखा गया था। यह कहानी सुधा और चंदर नामक दो प्रेमियों की कहानी बताती है, जिन्हें उनके परिवारों ने अलग कर दिया था। सुधा अमीर परिवार से आती है और उसे एक अच्छे पति को खोजने के लिए कई दावेदारों पर विचार करना पड़ता है। चंदर एक गरीब किसान परिवार से आता है और वह अपनी शिक्षा और भविष्य में अपने सपनों को पूरा करने के लिए कड़ी मेहनत करता है। सुधा अंततः चंदर को चुनती है और दोनों शादी कर लेते हैं। हालाँकि, उनकी खुशी अल्पकालिक होती है क्योंकि सुधा का अभिशाप उसके पूर्वजों द्वारा डाला जाता है कि वह कभी भी खुश नहीं होगी। सुधा और चंदर तब तक प्यार करते रहे जब तक कि सुधा मर जाती और चंदर फिर अपना शेष जीवन अकेले रहने के लिए छोड़ देता। उपन्यास करुणा और प्रेम के विषयों की पड़ताल करता है, और अक्सर इसे रोमांटिक साहित्य की उत्कृष्ट कृति माना जाता है।

Example #B2

Prompt

मैं मधुबनी कलाकारी के बारे में ज़्यादा जानकारी के लिए उत्सुक हूँ | कृप्या इस कलाकारी के बारे में मुझे १०० शब्दों में जानकारी दें |

Model Response

मुबलि कला एक प्रकार की अमूर्त कला है जो मानव शरीर और अंगों का प्रतिनिधित्व करती है। यह एक बहुमुखी कला रूप है जिसे पारंपरिक और समकालीन दोनों तरीकों से किया जा सकता है, और इसका उपयोग अक्सर चित्रकला, मूर्तिकला और प्रदर्शन जैसे विभिन्न माध्यमों में किया जाता है।

Note: The model's outputs in examples #A3 and #A6 did not contains newlines between bullet points. They have been added here for readability.

Human Evaluation

We evaluate Airavata on a set of real-world prompts written by real-users. We test our model in 5 different abilities listed in the table below:

AbilityName Ability
Long Ability to generate long-form text like writing essays, speeches, reports, etc.
Fact-Ops Ability to give factual opinions and explanations like seeking recommendations, seeking advice, opinions, explanations, etc.
Content Ability to make content accessible like summarizations, layman explanations, etc
Lang-Creativity Ability to be creative in language like finding anagrams, rhyming words, vocabulary enhancement, etc
Culture Ability to answer questions related to Indian Culture.

For each ability, we define a list of intents and domains which are then given to users along with detailed instructions on what kind of prompts are expected. More details about this benchmark are coming soon.

Along with Airavata, we also evaluate ChatGPT (OpenAI et al., 2020), GPT-4 (Achiam et al., 2023) and BactrianX-llama-7B (Li et al., 2023) models on the same abilities. BactrianX-llama-7B is an instructed fine-tuned model for Hindi created by directly finetuning the base Llama model on translated machine instructions from ALPACA and Dolly datasets, followed by response generation from ChatGPT. Annotators were shown a prompt and the response from any one of the models at random and asked to give a rating for the metrics listed in the below table.

Metric Details Range
IFA: Instruction Following Ability This assesses the model's ability to accurately and effectively follow the instructions provided in the prompt 0-2
CNS: Closeness to Native Speaker This assesses how naturally and fluently the model’s responses align with the way a native Hindi speaker would express the same ideas. 0-2
CQ: Content Quality This evaluates the response in terms of its factual accuracy, logical flow of ideas, and overall informational relevance. 0-2

In addition to the above metrics, we also ask the user to give a final score between 1 and 5 on their overall satisfaction with the response.

We sample a set of 50 prompts covering various intents and domains (more details about the benchmark coming soon) and get the responses from all three models. The annotators were not made aware of what models they were evaluating to avoid any biases. They were solely told to evaluate the response based only on the above metrics and the rubrics provided. We report the various results below:

Average satisfaction rating for model responses
Metric comparison for different models

We observe that while Airavata still has to improve on Instruction Following ability, its gap with respect to GPT-4 and ChatGPT producing natural-sounding content is narrower. Airavata is significantly better than BactrianX-llama-7B. The fact that Bactrian-X has no extended vocabulary, continued pre-training on Hindi, less diverse instruction tuning data and potentially low-quality Hindi instruction tuning data generated by ChatGPT could explain its inferior performance. OpenHathi and Airavata address these issues. We next dig into the performance of various abilities whose results are shown below:

Average satisfaction rating for model responses

The results show that amongst all abilities, Airavata is best at giving factual opinions and explanations. This is also evident from the examples shown earlier. We observe that the model fails to perform in language creative tasks, which is understandable as our SFT data doesn't have any creative components. Comparing the performance of GPT-4 and ChatGPT (GPT-3.5), it is evident that GPT-4 outperforms its counterpart in tasks that are knowledge-intensive or those that require creativity. But surprisingly, ChatGPT outperforms or is comparable on tasks that focus more on the language generation capability like long-form generation, factual opinions, and content accessibility.

We acknowledge that this evaluation is not robust and thorough due to the number of prompts in our set and each prompt and response pair being evaluated by only one annotator. But this still provides us with various insights that will guide us in the next steps of improving the model. Larger diverse instruction dataset to cover more abilities can help improve different abilities. At the same time, it must be acknowledged that most of the knowledge comes from English which has the largest repository of knowledge. Better alignment of Hindi with English representation is key to answering factual questions and reducing hallucinations.

Toxicity and Misinformation Detection

We evaluate Airavata, OpenHathi and Llama2-7B models with publicly available benchmark datasets, in both 0-shot and 5-shot settings. Our evaluation provides insights into key dimensions for LM safety. Multilingual HateCheck (MHC) is a suite of functional tests for hate speech detection and we use its Hindi subset (Das et al., 2022) to evaluate and compare the performance of all models. We translate the TruthfulQA, Implicit Hate, and a human evaluated subset of the Toxigen dataset, to Hindi. This subset of Toxigen has been denoised to retain instances which have annotation agreement from all annotators (Hosseini et al., 2023). While the Implicit Hate dataset (Hosseini et al., 2023) helps evaluate the model performance on detecting the subtle and implicit forms of hate speech, human-evaluated Toxigen data contains instances which are directed towards various demographics. We evaluate the model performance on detection of toxicity in these three datasets, and their translated instances using the accuracy metric. Further, for evaluating the model capability towards answering factual questions, we use the TruthfulQA dataset (Lin et al., 2022) which contains multiple choice questions which are factual and can mimic common human falsehoods.

Given the accuracy scores from our evaluation, in table below, Airavata is able to detect openly expressed hate in Hindi statements from MHC with an accuracy similar to the other two models, with similar performance in both 0- and 5-shot settings. On the more challenging instances which contain implicitly veiled hate speech, Airavata is able to identify hate with significantly better accuracy than the other two models within the translated Hindi instances. On the original Implicit Hate dataset, Llama2-7B seems to perform better, given a few examples. On the Translated Toxigen subset, Llama2-7B is able to detect targeted toxic instances against certain demographics with the highest accuracy among all three models. However, given a few examples, we observe a significant performance dip for Llama2-7B and Airavata outperforms it marginally. We observe similar performance on the original English dataset and note that Airavata is better at detecting targeted hate in Hindi, as compared to implicitly veiled hate speech. Its performance at detecting targeted hate is surprisingly better than detecting openly expressed hate speech from MHC. On the TruthfulQA dataset, in both 0- and 5-shot settings, Llama2-7B outperforms OpenHathi and Airavata. On the translated TruthfulQA data, a marginal dip in the performance can be observed which indicates that we need further investigation into the model's capability for generating misinformation.

Overall, these results may suggest that LLMs are able to identify toxicity and hateful speech, we think that further investigation is needed to evaluate toxicity and the presence of social biases within the content generated by LLMs. In the future, we plan to investigate additional existing benchmarks and novel evaluation measures to test LLMs for content safety and reliability.

Table 7: Accuracy on hate and toxicity identification, and answering factual questions
Variant 0-Shot 5-Shot
OpenHathi Llama 2 7B Chat
(translate-test)
Airavata OpenHathi Llama 2 7B Chat
(translate-test)
Airavata
Multilingual HateCheck Hindi 70.15 70.24 70.24 70.15 70.24 70.25
Implicit Hate English 50.65 57.92 62.33 51.41 65.02 62.44
Hindi (Translated) 52.45 53.21 61.15 49.99 52.98 58.84
Toxigen
(human evaluated)
English 44.91 83.35 78.63 42.71 66.34 72.24
Hindi (Translated) 47.75 83.97 78.56 42.83 73.20 74.80
TruthfulQA
(averaged MC1 & MC2)
English 30.72 37.25 33.60 30.72 37.25 33.64
Hindi (Translated) 34.31 35.66 35.32 34.31 35.66 35.32

Resources

You can find all information about the project here. We release the following resources to facilitate research into instruction tuning for Indian language LLMs.

Summary and Future Outlook

We release Airavata, an open-source instruction tuned model for Hindi that shows encouraging performance on a wide range of tasks compared to other open-source models. We make available all the datasets and models for further research into improving Hindi LLMs. This is a first step towards building high-quality open-source LLMs for Indian languages that encompass large pre-training datasets, diverse instruction tuning datasets and high-quality models.

Limitations

Airavata, like other large language models (LLMs), encounters typical challenges. These include a possibility for hallucination, leading to fabricated information, and may struggle with accuracy in complex or specialized topics. There's also a risk of producing objectionable or biased content. Its grasp of cultural subtleties and effectiveness in mixed-language situations may be limited. In addition, the model's performance is closely linked to the quality and breadth of its training data, which may impact its effectiveness and dependability. This is a model for research purposes and is not recommended for any production usecases.

Team

This is a joint effort with collaborators from multiple institutions, including Nilekani Centre at AI4Bharat, IIT Madras, IIIT D&M Kancheepuram, Flipkart, University of Surrey, NICT, A*STAR, IBM Research and Microsoft.

  • Students (in order of contribution): Jay Gala, Thanmay Jayakumar, Jaavid Aktar Husain, Aswanth Kumar, Mohammed Safi Ur Rahman Khan.
  • Advisors: Ratish Puduppully, Mitesh Khapra, Diptesh Kanojia, Raj Dabre, Rudra Murthy, Anoop Kunchukuttan.

Feel free to reach out to following in case of any queries:

Citation

If you find our work to be useful then please cite our technical report:

@article{gala2024airavata,
  title   = {Airavata: Introducing Hindi Instruction-tuned LLM},
  author  = {Jay Gala and Thanmay Jayakumar and Jaavid Aktar Husain and Aswanth Kumar M and Mohammed Safi Ur Rahman Khan and Diptesh Kanojia and Ratish Puduppully and Mitesh M. Khapra and Raj Dabre and Rudra Murthy and Anoop Kunchukuttan},
  year    = {2024},
  journal = {arXiv preprint arXiv: 2401.15006}
}

References

  1. Gala et al. "IndicTrans2: Towards High-Quality and Accessible Machine Translation Models for all 22 Scheduled Indian Languages." TMLR 2023.
  2. Wei et al. "PolyLM: An Open Source Polyglot Large Language Model." arXiv preprint arXiv:2307.06018.
  3. Sarvam et al. "Announcing OpenHathi Series." Sarvam Blog.
  4. Conover et al. "Free Dolly: Introducing the World's First Truly Open Instruction-Tuned LLM." Databricks Blog.
  5. Longpre et al. "The Flan Collection: Designing Data and Methods for Effective Instruction Tuning." ICML 2023.
  6. Bai et al. "Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback."arXiv preprint arXiv:2204.05862.
  7. Köpf et al. "OpenAssistant Conversations -- Democratizing Large Language Model Alignment." NeurIPS 2023.
  8. Zheng et al. "LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset." arXiv preprint arXiv:2309.11998.
  9. Wang et al. "How Far Can Camels Go? Exploring the State of Instruction Tuning on Open Resources." NeurIPS 2023.
  10. Touvron et al. "Llama 2: Open Foundation and Fine-Tuned Chat Models." arXiv preprint arXiv:2307.09288.
  11. Brown et al. "Language Models are Few-Shot Learners." NeurIPS 2020.
  12. Achiam et al. "GPT-4 Technical Report." arXiv preprint arXiv:2303.08774.
  13. Doddapaneni et al. "Towards Leaving No Indic Language Behind: Building Monolingual Corpora, Benchmark and Models for Indic Languages." ACL 2023.
  14. Hu et al. "LoRA: Low-Rank Adaptation of Large Language Models." ICLR 2022.
  15. Hendrycks et al. "Measuring Massive Multitask Language Understanding." ICLR 2021.
  16. Kumar et al. "IndicNLG Benchmark: Multilingual Datasets for Diverse NLG Tasks in Indic Languages." EMNLP 2022.
  17. Clark et al. "BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions." NAACL 2019.
  18. Zellers et al. "HellaSwag: Can a Machine Really Finish Your Sentence?." ACL 2019.
  19. Clark et al. "Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge." arXiv preprint arXiv:1803.05457.
  20. Krell et al. "Efficient Sequence Packing without Cross-contamination: Accelerating Large Language Models without Impacting Performance." ICLR 2023.
  21. Iyer et al. "OPT-IML: Scaling Language Model Instruction Meta Learning through the Lens of Generalization." arXiv preprint arXiv:2212.12017.
  22. Li et al. "Bactrian-X: Multilingual Replicable Instruction-Following Models with Low-Rank Adaptation." arXiv preprint arXiv:2305.15011.
  23. Sakaguchi et al. "WinoGrande: An Adversarial Winograd Schema Challenge at Scale." arXiv preprint arXiv:1907.10641.
  24. Ahuja et al. "MEGA: Multilingual Evaluation of Generative AI." EMNLP 2023.
  25. Lin et al. "TruthfulQA: Measuring How Models Mimic Human Falsehoods." ACL 2022.
  26. Das et al. "HateCheckHIn: Evaluating Hindi Hate Speech Detection Models." LREC 2022.
  27. Hosseini et al. "An Empirical Study of Metrics to Measure Representational Harms in Pre-Trained Language Models." TrustNLP 2023.
  28. ElSherief et al. "Latent Hatred: A Benchmark for Understanding Implicit Hate Speech." EMNLP 2021.