Airavata: Introducing Hindi Instruction-tuned LLM

Jay Gala, Thanmay Jayakumar, Jaavid Aktar Husain, Aswanth Kumar, Mohammed Safi Ur Rahman Khan, Diptesh Kanojia, Ratish Puduppully, Mitesh Khapra, Raj Dabre, Rudra Murthy, Anoop Kunchukuttan
25 January 2024
Cover image generated by DALL-E 3

Cover image for Airavata generated by DALL-E 3.

Cover image for Airavata instruction-tuned model generated from DALL-E

Examples of open-ended text generation by Hindi instruction-tuned Airavata model.

Introduction

The last year has witnessed tremendous interest and activity in the world of large language models (LLMs). LLMs hold the potential to unlock exciting applications in artificial intelligence, thanks to their ability to comprehend complex natural language instructions and excel in a broad spectrum of tasks involving language, knowledge, reasoning, and creative generation. To foster research, innovation, and widespread adoption, an open ecosystem is essential. We have observed significant advancements in this area with the launch of models like Llama 2 and Mistral, as well as their instruction-tuned variants such as Llama 2 Chat, Mistral-Instruct, and Zephyr, among others. Major progress has also been made in developing datasets for pre-training (e.g., RedPajama), instruction tuning (e.g., Alpaca, UltraChat, Dolly, OpenAssistant, LMSYS-Chat), and evaluation benchmarks (e.g., AlpacaEval, MT-Bench). However, most of these advancements have been predominantly centered on the English language.

There is some limited support for Indian languages, which can be attributed to the incidental inclusion of some Indian language data that slipped through the data filters during the pre-training of these language models. However, the representation of data, the efficacy of tokenizers, and task performance for Indian languages are considerably behind that of English. The performance in Indian languages, even on closed-source models such as ChatGPT (OpenAI et al., 2020), GPT-4 (Achiam et al., 2023), and others, is inferior compared to English (Ahuja et al., 2023). Therefore, there is an urgent need to develop a similar ecosystem of tools, models, and datasets for Indian languages to foster research and innovation. In pursuit of this objective, we have recently collaborated with Sarvam AI to launch OpenHathi (Sarvam et al., 2023), an open-source foundational model for Hindi, developed by extending Llama 2 (Touvron et al., 2023).

Today, we announce the next step - an initial release of "Airavata", an instruction-tuned model for Hindi built by finetuning OpenHathi (Sarvam et al., 2023) with diverse, instruction-tuning Hindi datasets to make it better suited for assistive tasks.

Along with the model, we also share the instruction tuning datasets to enable further research for IndicLLMs. We rely on human-curated, license-friendly instruction-tuned datasets to build "Airavata." We do not use data generated from proprietary models like GPT-4 etc. We think this is a more sustainable way of building instruction-tuned models at scale for most Indic languages, where relying on distilled data from commercial models would increase costs and restrict their free usage in downstream applications due to licensing restrictions.

We also compile a collection of evaluation benchmarks along with an evaluation framework to compare various LLMs for their abilities on diverse tasks when instructed in Hindi. Using this benchmark as well as human judgments, we compare different LLMs to quantify the current state of their Hindi capabilities. We conduct a detailed analysis of Airavata's performance on a variety of NLU and NLG tasks and find that the instruction finetuning helps align the model to a variety of NLU tasks. There is significant potential for improvement in NLG tasks, which requires creation of larger, more diverse instruction datasets as well as innnovations in aligning English model representations to Hindi representations to drive better cross-lingual transfer.

Instruction Tuning Dataset Creation

High-quality instruction tuning datasets are important for the good performance of LLMs. However, there are few diverse datasets for Hindi. Following Wei et al. (2023), we rely on translating high-quality English-supervised instruction-tuning datasets into Hindi. We use IndicTrans2 (Gala et al., 2023), the state-of-the-art open-source MT model for Indian languages, for translation. Some previous works (Li et al., 2023, Wei et al., 2023) have used ChatGPT (OpenAI et al., 2020) to translate instructions and/or generate responses into Hindi to better use context during translation (IndicTrans2 and most MT models are sentence-level). However, this is not cost-effective and the translation quality of ChatGPT is lower than IndicTrans2 (Gala et al., 2023), and its generation quality in Hindi might not be up to the mark (Ahuja et al., 2023). A future avenue of work would be improving translation quality when document context is available.

We sample examples from different datasets listed in Table 1 to ensure balanced representations across all the tasks while fitting into our instruction tuning budget. We translate the instructions, input, and outputs into Hindi. This results in a total of 404k examples spanning English and Hindi language. The translated Hindi examples were filtered to retain high-quality examples. Specifically, examples were retained only when the chrF++ score between the back-translated example and the corresponding English example was 50 or above. The final dataset used for instruction tuning contains 385k examples. Table 1 shows the details of the final training dataset. The dataset can be found on the ЁЯдЧ HuggingFace Hub.

We also create two native Hindi Instruction datasets:

  • wikiHow: wikiHow is an online wiki-style platform that serves as a valuable resource for a diverse array of how-to articles spanning numerous topics. The articles on the platform are human-moderated, thereby ensuring a high standard of quality. The questions posed by users in these articles closely align with potential use cases for this model, making it a rich resource for training models. Additionally, this might also help induce reasoning capabilities and generate logical step-by-step responses. We curate around 20k and 6k articles in English and Hindi respectively, resulting in a total of around 27k articles. We currently formulate the data as a completion task given either question or question along with a few initial steps.
  • Anudesh: Anudesh is a crowd-sourced collection of prompts accompanied by responses generated from the Llama-2 70B model. Participants are provided with clear guidelines detailing the nature of the interaction required, including the specific language to be employed. These languages encompass a range that includes Indic languages, English, transliterated Indic, as well as a blend of Indic and English in a code-mixed format. Contributors craft their prompts in adherence to these directives and the specified language criteria. Subsequently, these prompts are then paired with the corresponding translated outputs from the Llama 2 70B model. More details about the interactions will be released soon.


Table 1: Instruction Finetuning Training Dataset Details
Dataset Description Unfiltered Filtered License
English Hindi English Hindi
FLAN-v2
(Longpre et al., 2023)
A collection of NLP tasks that combines a number of existing NLP datasets with various data augmentations, introduced by Chung et al. (2022). We sample around 67K examples for our training mixture. 67,463 67,463 67,463 65,228 Apache-2.0
Anthropic-HHH
(Bai et al., 2022)
A collection of human collected preference data for aligning the models to be helpful and harmless. We sample 5K conversation from "chosen" column for our training mixture 5,000 5,000 5,000 4,911 MIT
Dolly
(Databricks et al., 2023)
A corpus of more than 15K records generated by thousands of Databricks employees to enable LLMs to exhibit the magical interactivity of ChatGPT. 15,011 15,011 15,011 14,880 CC-BY-SA-3.0
OpenAssistant
(K├╢pf et al., 2023)
A collection of human-generated, human-annotated assistant-style conversation corpus consisting of 38K messages, resulting in over 3K conversation trees and around 20K conversations. 19,945 20,128 19,945 16384 Apache-2.0
LymSys-Chat
(Zheng et al., 2023)
A collection of 1M real-world conversations spanning 25 SOTA LLMs similar to OpenAssistant. We sample 50K conversations for our training mixture. 50,000 50,000 50,000 37,422 LMSYS-Chat-1M Dataset License Agreement
wikiHow A collection of how-to-articles spanning a diverse range of daily life topics from an online wiki-style platform. 20,400 6,055 20,400 6,055 CC-0
Anudesh A collection of crowd-sourced prompts accompanied by responses generated from the Llama-2 70B model. 5,234 7,577 5,234 7,577 CC-BY-4.0
NMT
(Gala et al., 2023)
A multi-domain human-annotated dataset containing 50K bitext English-Hindi translation pairs from BPCC-Human (Gala et al., 2023) to enable better cross-lingual transfer. 50,000 - 50,000 - CC-BY-4.0

Supervised Fine-tuning

We fine-tune the OpenHathi model using the above-compiled datasets. We perform parameter-efficient finetuning with LoRA (Hu et al., 2022). The hyper-parameters used are listed in the table below.

Table 2: Hyperparameters for Fine-tuning
Hyper-Parameter Value
LoRA Rank 16
LoRA alpha 32
LoRA Dropout 0.05
LoRA Target Modules ["q_proj", "v_proj", "k_proj", "down_proj", "gate_proj", "up_proj"]
Epochs 4
Learning rate 5e-4
Batch Size 128
Floating Point Precision bfloat16

During fine-tuning, the loss was computed only for the output tokens. We used the OpenInstruct framework for fine-tuning, and customizing it for our requirements (our custom version is available as IndicInstruct). One fine-tuning example corresponds to one example in the dataset. However, this is suboptimal since many tokens are wasted as padding tokens. We plan to optimize this process by packing multiple dataset examples into a single fine-tuning example (Krell et al., 2023, Iyer et al., 2022).

Model Selection

We fine-tune the OpenHathi model for 4 epochs and save the model after each epoch. We evaluate each epochтАЩs checkpoint on the dev set and compare the average performance. We observe that the epoch 3 checkpoint performs well on NLU tasks and epoch 4 checkpoint performs well on NLG tasks. We perform checkpoint averaging, where we interpolate the weights of the above two checkpoints to obtain a model that performs well on both NLU as well as NLG tasks. We found the best interpolation weight to be around 0.6.

interpolated weights = 0.6 * checkpoint_3 + (1 - 0.6) * checkpoint_4

Full vs. LoRA finetuning

Full fine-tuning (FFT), where all the parameters of the model are updated, and LoRA fine-tuning, where a small subset of additional parameters are updated, are popular methods for instruction fine-tuning in large language models (LLMs). We fine-tuned two models: one using full fine-tuning and the other using LoRA fine-tuning, focusing on a portion of the instruction fine-tuning dataset, namely FLAN v2 English + Hindi. For our evaluation, we used a subset of Natural Language Understanding (NLU) tasks in Hindi, along with BoolQ and MMLU tasks in English, as development sets to decide between full fine-tuning and LoRA fine-tuning. We observed that models fully fine-tuned outperformed the OpenHathi base model in IndicCopa and IndicXParaphrase tasks. However, the fully fine-tuned model performed poorly on English tasks compared to both the base and LoRA models. LoRA fine-tuning either showed improvements or maintained the base modelтАЩs performance on both Hindi NLU and English tasks. Consequently, we chose to use LoRA fine-tuning for all our models. All results reported subsequently are for LoRA fine-tuned models.

Full v/s LoRA fine-tuning ablation results on few test sets.

Evaluation on NLP Benchmarks.

We evaluate our model on a variety of NLU and NLG tasks diversity. These include native Hindi test sets from IndicXTREME (Doddapaneni et al., 2023) and Indic NLG Suite (Kumar et al., 2022). To test the knowledge and reasoning capabilities of the model, we evaluate on the machine-translated version of the benchmarks such as MMLU (Hendrycks et al., 2021), Hellaswag (Zellers et al., 2019), ARC (Clark et al., 2018), Winogrande (Sakaguchi et al., 2019) and BoolQ (Clark et al., 2019). The translations were also done using IndicTrans2. While not perfect, these give an indication of the trends in LLM performance for Hindi. An important area of work is the creation of equivalent benchmarks for Hindi.

Results

The tables below shows the comparison of Airavata with the base model (OpenHathi) as well as with a translate-test approach using a strong English model (Llama 2 7B Chat). In the translate-test approach, we translate the Hindi input into English using IndicTrans2 before prompting the English model. We see that Airavata outperforms OpenHathi significantly for most tasks, showing that finetuning on the IndicInstruct dataset helps align the base model to a variety of tasks. The performance of translate-test varies a lot, while Airavata achieves more consistent performance. On translation, OpenHathi and Airavata have similar performance. OpenHathi is already trained on parallel corpora, hence the base model is already good at translation. The Airavata model retains that performance. Performance on generation tasks is a mixed bag, indicating the need for further improvement. Table 4 shows a comparison on English testsets along with the corresponding (machine translated) Hindi testset. We see that there exists a 5-15 point gap between English and Hindi accuracy across various tasks for both OpenHathi and Airavata. This indicates that English knowledge is not being transferred to Hindi, showing the need for better alignment between English and Hindi in the models.

Table 3: F1 scores on Indic NLU and Commonsense Reasoning tasks
0-Shot 5-Shot
OpenHathi Llama 2 7B Chat
(translate-test)
Airavata OpenHathi Llama 2 7B Chat
(translate-test)
Airavata
IndicSentiment 72.89 97.85 95.81 96.59 98.43 97.01
IndicCopa 68.69 76.53 63.75 42.77 78.34 72.97
IndicXNLI 16.67 23.67 73.26 42.25 47.96 74.7
IndicXParaphrase 71.72 9.54 76.53 66.67 48.56 69.87
Table 4: Accuracy on English NLU and Commonsense Reasoning tasks and its translated variants
Variant 0-Shot 5-Shot
OpenHathi Airavata OpenHathi Airavata
MMLU English 36.16 41.39 40.12 43.28
Hindi (Translated) 32.27 34.96 35.13 36
BoolQ English 52.63 73 64.46 62.02
Hindi (Translated) 58.56 64.5 65.69 51.47
ARC Easy English 57.28 70.5 62.12 71.04
Hindi (Translated) 44.28 54 49.87 54.84
Arc Challenge English 39.85 45.90 46.25 48.29
Hindi (Translated) 32.68 35.92 36.6 36.26
Hella Swag English 31.48 34.37 32.45 36.83
Hindi (Translated) 25.59 25.37 24.85 24.84
Winogrande English 49.17 49.72 - -
Table 5: chrF++ and BLEURT scores on English-Hindi translation task
Metric 0-Shot 5-Shot
OpenHathi Airavata OpenHathi Airavata
Flores chrF++ 55.41 54.82 54.98 54.24
BLEURT 0.7103 0.6970 0.7115 0.7084
IN22-Gen chrF++ 54.23 53.78 54.53 52.81
BLEURT 0.7156 0.7012 0.7181 0.7037
Table 6: F1, Rouge L and BLEURT scores on Indic NLG tasks
Metric 1-Shot
OpenHathi Llama 2 7B Chat
(translate-test)
Airavata
Indic QA (No Context) F1 17 4.58 21.01
Indic QA (With Context) 20.69 19.59 37.69
Indic Headline Rouge L 11.26 23.06 12.32
BLEURT 0.4682 0.4590 0.3793
IndicWikiBio Rouge L 20.45 41.01 10.66
BLEURT 0.5185 0.6415 0.4279

Examples

Examples where Airavata generates good output

Airavata performs well on some conent generation tasks like asking for mail drafts (#A1), as well for general how-to questions (#A3, #A6). It can follow instructions regarding output structure (#A4 and #A5).

Example #A1

Prompt

рдЬрд┐рд╕ рдЧреНрд░рд╛рд╣рдХ рдХреЛ рдЕрдкрдиреЗ рдСрд░реНрдбрд░ рдореЗрдВ рджреЗрд░реА рдХрд╛ рдЕрдиреБрднрд╡ рд╣реБрдЖ рд╣реЛ, рдЙрд╕рдХреЗ рд▓рд┐рдП рдПрдХ рдорд╛рдлреА рдИрдореЗрд▓ рдбреНрд░рд╛рдлреНрдЯ рдХрд░реЗрдВ рдФрд░ рдЖрд╢реНрд╡рд╕реНрдд рдХрд░реЗрдВ рдХрд┐ рд╕рдорд╕реНрдпрд╛ рдХрд╛ рд╕рдорд╛рдзрд╛рди рд╣реЛ рдЧрдпрд╛ рд╣реИред

Model Response

рдкреНрд░рд┐рдп [рдЧреНрд░рд╛рд╣рдХ рдХрд╛ рдирд╛рдо], ред
рдореБрдЭреЗ рдЖрд╢рд╛ рд╣реИ рдХрд┐ рдпрд╣ рдИрдореЗрд▓ рдЖрдкрдХреЛ рдЕрдЪреНрдЫреА рддрд░рд╣ рд╕реЗ рдорд┐рд▓ рдЬрд╛рдПрдЧрд╛ред рдореИрдВ рдЖрдкрдХреЛ рдпрд╣ рд╕реВрдЪрд┐рдд рдХрд░рдиреЗ рдХреЗ рд▓рд┐рдП рд▓рд┐рдЦ рд░рд╣рд╛ рд╣реВрдВ рдХрд┐ рд╣рдорд╛рд░реЗ рдкрд┐рдЫрд▓реЗ рдСрд░реНрдбрд░ рдореЗрдВ рджреЗрд░реА рдХреЗ рдХрд╛рд░рдг, [рдСрд░реНрдбрд░ рд╕рдВрдЦреНрдпрд╛] рдХреЗ рд▓рд┐рдП рдЖрдкрдХреА рдбрд┐рд▓реАрд╡рд░реА рдореЗрдВ рдереЛрдбрд╝реА рджреЗрд░реА рд╣реБрдИ рд╣реИред рдореИрдВ рд╕рдордЭрддрд╛ рд╣реВрдВ рдХрд┐ рдпрд╣ рдЕрд╕реБрд╡рд┐рдзрд╛рдЬрдирдХ рд╣реЛ рд╕рдХрддрд╛ рд╣реИ рдФрд░ рдореИрдВ рдорд╛рдлреА рдорд╛рдВрдЧрддрд╛ рд╣реВрдВред ред
рдореИрдВ рдпрд╣ рд╕реБрдирд┐рд╢реНрдЪрд┐рдд рдХрд░рдиреЗ рдХреЗ рд▓рд┐рдП рд╣рд░ рд╕рдВрднрд╡ рдкреНрд░рдпрд╛рд╕ рдХрд░ рд░рд╣рд╛ рд╣реВрдВ рдХрд┐ рдЗрд╕ рдореБрджреНрджреЗ рдХреЛ рдЬрд▓реНрдж рд╕реЗ рдЬрд▓реНрдж рд╣рд▓ рдХрд┐рдпрд╛ рдЬрд╛рдП рдФрд░ рднрд╡рд┐рд╖реНрдп рдореЗрдВ рдРрд╕реА рдХрд┐рд╕реА рднреА рдШрдЯрдирд╛ рд╕реЗ рдмрдЪрдиреЗ рдХреЗ рд▓рд┐рдП рд╣рдо рдЬреЛ рднреА рдХрджрдо рдЙрдард╛ рд╕рдХрддреЗ рд╣реИрдВ, рд╡рд╣ рдЙрдард╛ рд░рд╣реЗ рд╣реИрдВред ред
рдЗрд╕ рдмреАрдЪ, рдореИрдВ рдЖрдкрдХреЛ рд╕реВрдЪрд┐рдд рдХрд░рдирд╛ рдЪрд╛рд╣рддрд╛ рд╣реВрдВ рдХрд┐ [рдСрд░реНрдбрд░ рд╕рдВрдЦреНрдпрд╛] рдХреЗ рд▓рд┐рдП рдЖрдкрдХреА рдбрд┐рд▓реАрд╡рд░реА рдЕрдм [рддрд┐рдерд┐] рдкрд░ рдЙрдкрд▓рдмреНрдз рд╣реИред рдХреГрдкрдпрд╛ рдореБрдЭреЗ рдмрддрд╛рдПрдВ рдХрд┐ рдХреНрдпрд╛ рдЖрдкрдХреЗ рдкрд╛рд╕ рдХреЛрдИ рдкреНрд░рд╢реНрди рдпрд╛ рдЪрд┐рдВрддрд╛рдПрдБ рд╣реИрдВред ред
рдЖрдкрдХреЗ рдзреИрд░реНрдп рдФрд░ рд╕рдордЭ рдХреЗ рд▓рд┐рдП рдзрдиреНрдпрд╡рд╛рджред рдореИрдВ рднрд╡рд┐рд╖реНрдп рдореЗрдВ рдЖрдкрдХреЗ рд╕рд╛рде рдХрд╛рдо рдХрд░рдиреЗ рдХреЗ рд▓рд┐рдП рддрддреНрдкрд░ рд╣реВрдВред ред
рдИрдорд╛рдирджрд╛рд░реА рд╕реЗ, [рдЖрдкрдХрд╛ рдирд╛рдо]

Example #A2

Prompt

рднрд╛рд░рдд рдореЗрдВ рдмрдбрд╝реЛрдВ рдХреЗ рдкреИрд░ рдЫреВрдирд╛ рдФрд░ рдЙрдирд╕реЗ рдмрд╣рд╕ рди рдХрд░рдирд╛ рдиреИрддрд┐рдХ рдЖрдЪрд░рдг рдорд╛рдирд╛ рдЬрд╛рддрд╛ рдерд╛ред рдЗрд╕рдХреЗ рдкреАрдЫреЗ рдХреНрдпрд╛ рддрд░реНрдХ рд╣реИ рдореБрдЭреЗ рд╕рдордЭрд╛рдУ

Model Response

"рд╣рдореНрдо, рдпрд╣ рдПрдХ рджрд┐рд▓рдЪрд╕реНрдк рд╕рд╡рд╛рд▓ рд╣реИ! рднрд╛рд░рдд рдореЗрдВ рдмрдбрд╝реЛрдВ рдХреЛ рд╕рдореНрдорд╛рдирд┐рдд рд╡реНрдпрдХреНрддрд┐ рдорд╛рдирд╛ рдЬрд╛рддрд╛ рд╣реИ рдФрд░ рдЖрдорддреМрд░ рдкрд░ рдмреБрдЬреБрд░реНрдЧреЛрдВ рдХреА рдУрд░ рд░реБрдЦ рдХрд┐рдпрд╛ рдЬрд╛рддрд╛ рд╣реИ рдЬрдм рдХреЛрдИ рдХрдард┐рди рдирд┐рд░реНрдгрдп рд▓реЗрдирд╛ рд╣реЛрддрд╛ рд╣реИ рдпрд╛ рдЬрд╛рдирдХрд╛рд░реА рд▓реЗрдиреА рд╣реЛрддреА рд╣реИред рдРрд╕рд╛ рдЗрд╕рд▓рд┐рдП рд╣реИ рдХреНрдпреЛрдВрдХрд┐ рдмрдбрд╝реЛрдВ рдХрд╛ рдЬреНрдЮрд╛рди рдФрд░ рдЕрдиреБрднрд╡ рдкреАрдврд╝рд┐рдпреЛрдВ рд╕реЗ рдкрд╛рд░рд┐рдд рдХрд┐рдП рдЧрдП рд╣реИрдВ рдФрд░ рдЗрдиреНрд╣реЗрдВ рдмрд╣реБрдореВрд▓реНрдп рд╕рдВрдкрддреНрддрд┐ рдорд╛рдирд╛ рдЬрд╛рддрд╛ рд╣реИред рдЗрд╕рд▓рд┐рдП, рдмрдбрд╝реЛрдВ рдХреЗ рд╕рд╛рде рдмрд╛рддрдЪреАрдд рдХрд░рддреЗ рд╕рдордп рд╕рдореНрдорд╛рдирдкреВрд░реНрд╡рдХ рд╡реНрдпрд╡рд╣рд╛рд░ рдХрд░рдирд╛ рдФрд░ рдЙрдирдХреЗ рдкреИрд░реЛрдВ рдХреЛ рдЫреВрдиреЗ рдЬреИрд╕реЗ рд╡рд┐рдирдореНрд░ рдЗрд╢рд╛рд░реЛрдВ рдХрд╛ рдЙрдкрдпреЛрдЧ рдХрд░рдирд╛ рдорд╣рддреНрд╡рдкреВрд░реНрдг рдорд╛рдирд╛ рдЬрд╛рддрд╛ рд╣реИред ред

рдпрд╣ рдпрд╛рдж рд░рдЦрдирд╛ рднреА рдорд╣рддреНрд╡рдкреВрд░реНрдг рд╣реИ рдХрд┐ рдмрдбрд╝реЛрдВ рдХреЛ рдЕрдХреНрд╕рд░ рд▓реЛрдЧреЛрдВ рдХреЗ рд▓рд┐рдП рдорд╛рд░реНрдЧрджрд░реНрд╢рдХ рдФрд░ рд╕рд▓рд╛рд╣рдХрд╛рд░ рдХреЗ рд░реВрдк рдореЗрдВ рджреЗрдЦрд╛ рдЬрд╛рддрд╛ рд╣реИ, рдФрд░ рдЙрдиреНрд╣реЗрдВ рдЖрд╣рдд рдХрд░рдиреЗ рдпрд╛ рдЕрдкрдорд╛рдирд┐рдд рдХрд░рдиреЗ рдХреЗ рдмрдЬрд╛рдп рдЙрдирдХреА рд╕рд▓рд╛рд╣ рд▓реЗрдиреЗ рдХреА рдХреЛрд╢рд┐рд╢ рдХрд░рдиреА рдЪрд╛рд╣рд┐рдПред рдпрджрд┐ рдХрд┐рд╕реА рдмреБрдЬреБрд░реНрдЧ рдиреЗ рдЖрдкрдХреЛ рдХреБрдЫ рдЕрдиреБрдЪрд┐рдд рдХрд╣рд╛ рдпрд╛ рдЙрд╕рдХреЗ рд╡реНрдпрд╡рд╣рд╛рд░ рдореЗрдВ рдХреЛрдИ рдЧрд▓рддреА рдХреА, рддреЛ рдЙрдЪрд┐рдд рдкреНрд░рддрд┐рдХреНрд░рд┐рдпрд╛ рджреЗрдирд╛ рдФрд░ рд╕реНрдерд┐рддрд┐ рдХреЛ рдареАрдХ рд╕реЗ рд╕рдВрднрд╛рд▓рдирд╛ рдорд╣рддреНрд╡рдкреВрд░реНрдг рд╣реИред"

Example #A3

Prompt

рдореИрдиреЗ рдмрд╣реЛрдд рдЬрдЧрд╣ рдиреМрдХрд░реА рдХреЗ рд▓рд┐рдП рдПрдкреНрд▓рд╛рдИ рдХрд┐рдпрд╛ рдкрд░ рдХрд╣реА рд╕реЗ рд░рд┐рдкреНрд▓рд╛рдИ рдирд╣реА рдЖрдпрд╛ рдЬрд┐рд╕реНрд╕реЗ рдореБрдЭреЗ рдбрд┐рдкреНрд░реЗрд╢рди рд╣реЛ рд░рд╣рд╛ рд╣реЗ рдореБрдЭреЗ рдЗрд╕рдХрд╛ рдХреЛрдИ рдЙрдкрд╛рдп рджреЛ?

Model Response

рдпрд╣ рд╕реБрдирдХрд░ рдЦреЗрдж рд╣реИ рдХрд┐ рдЖрдкрдХреЛ рдЕрд╡рд╕рд╛рдж рдХрд╛ рд╕рд╛рдордирд╛ рдХрд░рдирд╛ рдкрдбрд╝ рд░рд╣рд╛ рд╣реИред рдпрд╣ рдпрд╛рдж рд░рдЦрдирд╛ рдорд╣рддреНрд╡рдкреВрд░реНрдг рд╣реИ рдХрд┐ рдЕрдХреЗрд▓реЗрдкрди, рдЪрд┐рдВрддрд╛ рдФрд░ рдирд┐рд░рд╛рд╢рд╛ рдЬреИрд╕реА рднрд╛рд╡рдирд╛рдУрдВ рдХреЛ рджреВрд░ рдХрд░рдиреЗ рдореЗрдВ рдорджрдж рдХрд░рдиреЗ рдХреЗ рдХрдИ рддрд░реАрдХреЗ рд╣реИрдВред рдпрд╣рд╛рдБ рдХреБрдЫ рд╕реБрдЭрд╛рд╡ рджрд┐рдП рдЧрдП рд╣реИрдВ рдЬреЛ рд╕рд╣рд╛рдпрдХ рд▓рдЧ рд╕рдХрддреЗ рд╣реИрдВрдГ
1. рдХрд┐рд╕реА рдЪрд┐рдХрд┐рддреНрд╕рдХ рдпрд╛ рд╕рд▓рд╛рд╣рдХрд╛рд░ рд╕реЗ рдмрд╛рдд рдХрд░реЗрдВ-рдХрд┐рд╕реА рдорд╛рдирд╕рд┐рдХ рд╕реНрд╡рд╛рд╕реНрдереНрдп рдкреЗрд╢реЗрд╡рд░ рд╕реЗ рдмрд╛рдд рдХрд░рдирд╛ рдЖрдкрдХреЗ рдЕрд╡рд╕рд╛рдж рдХреЛ рдкреНрд░рдмрдВрдзрд┐рдд рдХрд░рдиреЗ рдФрд░ рдЗрд╕реЗ рдХрдо рдХрд░рдиреЗ рдХреА рджрд┐рд╢рд╛ рдореЗрдВ рдкрд╣рд▓рд╛ рдХрджрдо рд╣реЛ рд╕рдХрддрд╛ рд╣реИред рд╡реЗ рдЖрдкрдХреА рднрд╛рд╡рдирд╛рдУрдВ рдХреЛ рд╕рдордЭрдиреЗ рдФрд░ рдЙрди рдкрд░ рдХрд╛рдмреВ рдкрд╛рдиреЗ рдХреЗ рд▓рд┐рдП рдПрдХ рдпреЛрдЬрдирд╛ рдмрдирд╛рдиреЗ рдореЗрдВ рдЖрдкрдХреА рдорджрдж рдХрд░ рд╕рдХрддреЗ рд╣реИрдВред
2. рдирд┐рдпрдорд┐рдд рд╡реНрдпрд╛рдпрд╛рдордГ рд╢рд╛рд░реАрд░рд┐рдХ рдЧрддрд┐рд╡рд┐рдзрд┐ рдПрдВрдбреЛрд░реНрдлрд┐рди рдЫреЛрдбрд╝рддреА рд╣реИ, рдЬреЛ рдкреНрд░рд╛рдХреГрддрд┐рдХ рдордиреЛрджрд╢рд╛ рдмрдврд╝рд╛рдиреЗ рд╡рд╛рд▓реЗ рд╣реЛрддреЗ рд╣реИрдВред рдпрд╣рд╛рдВ рддрдХ рдХрд┐ рдЫреЛрдЯреЗ рдкреИрдорд╛рдиреЗ рдХреЗ рд╡реНрдпрд╛рдпрд╛рдо рдЬреИрд╕реЗ рдЪрд▓рдирд╛, рддреИрд░рдирд╛ рдпрд╛ рдХреБрдЫ рд╣рд▓реНрдХрд╛ рдЦрд┐рдВрдЪрд╛рд╡ рднреА рдЕрд╡рд╕рд╛рджрдЧреНрд░рд╕реНрддрддрд╛ рдХреА рднрд╛рд╡рдирд╛ рдХреЛ рдХрдо рдХрд░рдиреЗ рдореЗрдВ рдорджрдж рдХрд░ рд╕рдХрддрд╛ рд╣реИред
3. рдкреНрд░рдХреГрддрд┐ рдореЗрдВ рд╕рдордп рдмрд┐рддрд╛рдПрдБрдГ рдкреНрд░рдХреГрддрд┐ рдореЗрдВ рд╕рдордп рдмрд┐рддрд╛рдиреЗ рд╕реЗ рддрдирд╛рд╡ рдХреЛ рдХрдо рдХрд░рдиреЗ рдФрд░ рдХрд▓реНрдпрд╛рдг рдХреА рднрд╛рд╡рдирд╛ рдХреЛ рдмрдврд╝рд╛рд╡рд╛ рджреЗрдиреЗ рдореЗрдВ рдорджрдж рдорд┐рд▓ рд╕рдХрддреА рд╣реИред рдЖрдк рдЯрд╣рд▓рдиреЗ рдЬрд╛ рд╕рдХрддреЗ рд╣реИрдВ, рдкреИрджрд▓ рдпрд╛рддреНрд░рд╛ рдХрд░ рд╕рдХрддреЗ рд╣реИрдВ рдпрд╛ рдмрд╛рдЧрд╡рд╛рдиреА рдХрд╛ рдХрд╛рдо рдХрд░рдХреЗ рдкреНрд░рдХреГрддрд┐ рдореЗрдВ рдЕрдкрдиреА рд░реБрдЪрд┐ рд╡рд┐рдХрд╕рд┐рдд рдХрд░ рд╕рдХрддреЗ рд╣реИрдВред
4. рдкреНрд░рд┐рдпрдЬрдиреЛрдВ рдХреЗ рд╕рд╛рде рдЬреБрдбрд╝реЗрдВрдГ рджреЛрд╕реНрддреЛрдВ, рдкрд░рд┐рд╡рд╛рд░ рдпрд╛ рд╕рд╣рдХрд░реНрдорд┐рдпреЛрдВ рдХреЗ рд╕рд╛рде рдЬреБрдбрд╝реЗрдВ рдЬреЛ рд╕рдорд░реНрдерди рдФрд░ рдкреНрд░реЛрддреНрд╕рд╛рд╣рди рдкреНрд░рджрд╛рди рдХрд░рддреЗ рд╣реИрдВред рд╕рд╛рдорд╛рдЬрд┐рдХ рдмрд╛рддрдЪреАрдд рдХреЛ рдкреНрд░реЛрддреНрд╕рд╛рд╣рд┐рдд рдХрд░рдиреЗ рд╕реЗ рдЕрд▓рдЧрд╛рд╡ рдФрд░ рдЕрдХреЗрд▓реЗрдкрди рдХреА рднрд╛рд╡рдирд╛рдУрдВ рдХреЛ рдХрдо рдХрд░рдиреЗ рдореЗрдВ рдорджрдж рдорд┐рд▓ рд╕рдХрддреА рд╣реИред
5. рд░рдЪрдирд╛рддреНрдордХ рдЧрддрд┐рд╡рд┐рдзрд┐рдпреЛрдВ рдореЗрдВ рд╕рдВрд▓рдЧреНрди рд░рд╣реЗрдВрдГ рдХрд▓рд╛, рд╕рдВрдЧреАрдд рдпрд╛ рд▓реЗрдЦрди рдЬреИрд╕реА рд░рдЪрдирд╛рддреНрдордХ рдЧрддрд┐рд╡рд┐рдзрд┐рдпреЛрдВ рдореЗрдВ рд╢рд╛рдорд┐рд▓ рд╣реЛрдиреЗ рд╕реЗ рдорди рдХреЛ рд╡рд┐рдЪрд▓рд┐рдд рдХрд░рдиреЗ рдФрд░ рдЦреБрд╢реА рдХреА рднрд╛рд╡рдирд╛ рд▓рд╛рдиреЗ рдореЗрдВ рдорджрдж рдорд┐рд▓ рд╕рдХрддреА рд╣реИред
6. рдорд╛рдЗрдВрдбрдлреБрд▓рдиреЗрд╕ рддрдХрдиреАрдХреЛрдВ рдХреЛ рдЖрдЬрдорд╛рдПрдБрдГ рдзреНрдпрд╛рди, рдЧрд╣рд░реА рд╕рд╛рдВрд╕ рд▓реЗрдиреЗ рдХреЗ рдЕрднреНрдпрд╛рд╕ рдпрд╛ рдпреЛрдЧ рд╕рд╣рд┐рдд рд╡рд┐рднрд┐рдиреНрди рдорд╛рдЗрдВрдбрдлреБрд▓рдиреЗрд╕ рддрдХрдиреАрдХреЗрдВ рдЖрд░рд╛рдо рдХрд░рдиреЗ рдФрд░ рд╡рд░реНрддрдорд╛рди рдХреНрд╖рдг рдкрд░ рдзреНрдпрд╛рди рдХреЗрдВрджреНрд░рд┐рдд рдХрд░рдиреЗ рдореЗрдВ рдорджрдж рдХрд░ рд╕рдХрддреА рд╣реИрдВред
7. рдЖрддреНрдо-рджреЗрдЦрднрд╛рд▓ рдХрд╛ рдЕрднреНрдпрд╛рд╕ рдХрд░реЗрдВрдГ рдкрд░реНрдпрд╛рдкреНрдд рдиреАрдВрдж рд▓реЗрдирд╛, рд╕рдВрддреБрд▓рд┐рдд рдЖрд╣рд╛рд░ рдЦрд╛рдирд╛ рдФрд░ рд╕реНрд╡рд╕реНрде рдЬреАрд╡рди рд╢реИрд▓реА рдмрдирд╛рдП рд░рдЦрдиреЗ рдЬреИрд╕реА рд╕реНрд╡-рджреЗрдЦрднрд╛рд▓ рдкреНрд░рдерд╛рдУрдВ рдореЗрдВ рд╢рд╛рдорд┐рд▓ рд╣реЛрдирд╛ рдЖрд╡рд╢реНрдпрдХ рд╣реИред рдЗрд╕рдореЗрдВ рд╕реНрдирд╛рди рдХрд░рдирд╛, рдХрд┐рддрд╛рдм рдкрдврд╝рдирд╛ рдпрд╛ рдЯреАрд╡реА рджреЗрдЦрддреЗ рд╣реБрдП рдирд╛рд╢реНрддрд╛ рдХрд░рдирд╛ рдЬреИрд╕реА рдЪреАрдЬреЗрдВ рд╢рд╛рдорд┐рд▓ рд╣реЛ рд╕рдХрддреА рд╣реИрдВред
8. рд╕рд╣рд╛рдпрддрд╛ рд╕рдореВрд╣реЛрдВ рдореЗрдВ рднрд╛рдЧ рд▓реЗрдВрдГ рд╕реНрдерд╛рдиреАрдп рд╕рд╣рд╛рдпрддрд╛ рд╕рдореВрд╣ рдпрд╛ рд╕рд╛рдореБрджрд╛рдпрд┐рдХ рд╕рдВрдЧрдарди рдЦреЛрдЬрдиреЗ рдХрд╛ рдкреНрд░рдпрд╛рд╕ рдХрд░реЗрдВ рдЬреЛ рдЖрдкрдХреЗ рд╕рдорд╛рди рдЪреБрдиреМрддрд┐рдпреЛрдВ рд╕реЗ рдЬреВрдЭ рд░рд╣реЗ рд╣реЛрдВред рджреВрд╕рд░реЛрдВ рд╕реЗ рдЬреБрдбрд╝рдХрд░, рдЖрдк рдЕрдкрдиреЗ рдЕрд╡рд╕рд╛рдж рдХреЛ рдХрдо рдХрд░рдиреЗ рдореЗрдВ рдорджрдж рдХрд░рдиреЗ рдХреЗ рддрд░реАрдХреЛрдВ рдХреА рдкрд╣рдЪрд╛рди рдХрд░ рд╕рдХрддреЗ рд╣реИрдВред
9. рдкреЗрд╢реЗрд╡рд░ рдорджрдж рд▓реЗрдВрдГ рдпрджрд┐ рдЖрдкрдХреЗ рдЕрд╡рд╕рд╛рдж рдХрд╛ рдкреНрд░рдмрдВрдзрди рдХрд░рдирд╛ рдмрд╣реБрдд рдХрдард┐рди рд▓рдЧрддрд╛ рд╣реИ, рддреЛ рдЪрд┐рдХрд┐рддреНрд╕рд╛ рдпрд╛ рдкрд░рд╛рдорд░реНрд╢ рдкреНрд░рд╛рдкреНрдд рдХрд░рдиреЗ рдкрд░ рд╡рд┐рдЪрд╛рд░ рдХрд░реЗрдВред рдПрдХ рдкреНрд░рд╢рд┐рдХреНрд╖рд┐рдд рдорд╛рдирд╕рд┐рдХ рд╕реНрд╡рд╛рд╕реНрдереНрдп рдкреЗрд╢реЗрд╡рд░ рдЖрдкрдХреЛ рдореБрдХрд╛рдмрд▓рд╛ рддрдВрддреНрд░ рд╕реАрдЦрдиреЗ рдореЗрдВ рдорджрдж рдХрд░ рд╕рдХрддрд╛ рд╣реИ рдФрд░ рдЖрдкрдХреЛ рдЕрдзрд┐рдХ рд╕рдХрд╛рд░рд╛рддреНрдордХ рдорд╛рдирд╕рд┐рдХрддрд╛ рд╡рд┐рдХрд╕рд┐рдд рдХрд░рдиреЗ рдореЗрдВ рдорджрдж рдХрд░ рд╕рдХрддрд╛ рд╣реИред рдпрд╛рдж рд░рдЦреЗрдВ, рдЕрд╡рд╕рд╛рдж рдХреЛ рджреВрд░ рдХрд░рдирд╛ рдЖрд╕рд╛рди рдирд╣реАрдВ рд╣реИ, рд▓реЗрдХрд┐рди рд╕рд╣реА рд╕рдорд░реНрдерди рдФрд░ рд░рдгрдиреАрддрд┐рдпреЛрдВ рдХреЗ рд╕рд╛рде, рдЖрдк рдкреНрд░рдЧрддрд┐ рдХрд░ рд╕рдХрддреЗ рд╣реИрдВ рдФрд░ рдмреЗрд╣рддрд░ рдорд╣рд╕реВрд╕ рдХрд░ рд╕рдХрддреЗ рд╣реИрдВред

Example #A4

Prompt

рдореИрдВ рдЕрдкрдиреЗ рд╕рдордп рдкреНрд░рдмрдВрдзрди рдХреМрд╢рд▓ рдХреЛ рдХреИрд╕реЗ рд╕реБрдзрд╛рд░ рд╕рдХрддрд╛ рд╣реВрдБ? рдореБрдЭреЗ рдкрд╛рдВрдЪ рдмрд┐рдВрджреБ рдмрддрд╛рдПрдВред

Model Response

1. рдЕрдкрдиреЗ рд▓рдХреНрд╖реНрдпреЛрдВ рдХреЛ рд▓рд┐рдЦреЗрдВред
2. рдЕрдкрдиреЗ рд▓рдХреНрд╖реНрдпреЛрдВ рдХреЛ рдЫреЛрдЯреЗ, рдкреНрд░рдмрдВрдзрдиреАрдп рдХрд╛рд░реНрдпреЛрдВ рдореЗрдВ рд╡рд┐рднрд╛рдЬрд┐рдд рдХрд░реЗрдВред
3. рдПрдХ рдЕрдиреБрд╕реВрдЪреА рдмрдирд╛рдПрдБ рдФрд░ рдЙрд╕ рдкрд░ рдЯрд┐рдХреЗ рд░рд╣реЗрдВред
4. рдЕрдкрдиреЗ рдХрд╛рд░реНрдпреЛрдВ рдХреЛ рдкреНрд░рд╛рдердорд┐рдХрддрд╛ рджреЗрдВред
5. рдЕрдкрдиреЗ рдХрд╛рд░реНрдпреЛрдВ рдХреЛ рдкреВрд░рд╛ рдХрд░рдиреЗ рдХреЗ рд▓рд┐рдП рдЦреБрдж рдХреЛ рдкреБрд░рд╕реНрдХреГрдд рдХрд░реЗрдВред

Example #A5

Prompt

рдореИрдВ рдЕрдкрдиреЗ рд╕рдордп рдкреНрд░рдмрдВрдзрди рдХреМрд╢рд▓ рдХреЛ рдХреИрд╕реЗ рд╕реБрдзрд╛рд░ рд╕рдХрддрд╛ рд╣реВрдБ? рдореБрдЭреЗ рдкрд╛рдВрдЪ рдмрд┐рдВрджреБ рдмрддрд╛рдПрдВ рдФрд░ рдЙрдирдХрд╛ рд╡рд░реНрдгрди рдХрд░реЗрдВред

Model Response

1. рдЕрдкрдиреЗ рдХрд╛рд░реНрдпреЛрдВ рдХреЛ рдкреНрд░рд╛рдердорд┐рдХрддрд╛ рджреЗрдВрдГ рдЙрди рдХрд╛рд░реНрдпреЛрдВ рдХреЛ рдкреНрд░рд╛рдердорд┐рдХрддрд╛ рджреЗрдВ рдЬрд┐рдиреНрд╣реЗрдВ рдкреВрд░рд╛ рдХрд░рдиреЗ рдХреА рд╕рдмрд╕реЗ рдЕрдзрд┐рдХ рдЖрд╡рд╢реНрдпрдХрддрд╛ рд╣реИ рдФрд░ рдЙрдиреНрд╣реЗрдВ рдкрд╣рд▓реЗ рдкреВрд░рд╛ рдХрд░реЗрдВред
2. рдПрдХ рдЕрдиреБрд╕реВрдЪреА рдмрдирд╛рдПрдБрдГ рдПрдХ рдЕрдиреБрд╕реВрдЪреА рдмрдирд╛рдПрдБ рдФрд░ рдЙрд╕ рдкрд░ рдЯрд┐рдХреЗ рд░рд╣реЗрдВред рдпрд╣ рдЖрдкрдХреЛ рдЕрдкрдиреЗ рд╕рдордп рдХрд╛ рдмреЗрд╣рддрд░ рдкреНрд░рдмрдВрдзрди рдХрд░рдиреЗ рдФрд░ рдЕрдкрдиреЗ рд▓рдХреНрд╖реНрдпреЛрдВ рдХреЛ рдкреНрд░рд╛рдкреНрдд рдХрд░рдиреЗ рдореЗрдВ рдорджрдж рдХрд░реЗрдЧрд╛ред
3. рдзреНрдпрд╛рди рднрдЯрдХрд╛рдиреЗ рд╕реЗ рдмрдЪреЗрдВрдГ рдЕрдкрдиреЗ рдлреЛрди рдХреЛ рдмрдВрдж рдХрд░ рджреЗрдВ рдФрд░ рдЕрдкрдиреЗ рдХрд╛рдо рдкрд░ рдзреНрдпрд╛рди рдХреЗрдВрджреНрд░рд┐рдд рдХрд░реЗрдВред
4. рдмреНрд░реЗрдХ рд▓реЗрдВрдГ рдирд┐рдпрдорд┐рдд рдмреНрд░реЗрдХ рд▓реЗрдиреЗ рд╕реЗ рдЖрдкрдХреЛ рд░рд┐рдЪрд╛рд░реНрдЬ рдХрд░рдиреЗ рдФрд░ рдЙрддреНрдкрд╛рджрдХ рдмрдиреЗ рд░рд╣рдиреЗ рдореЗрдВ рдорджрдж рдорд┐рд▓реЗрдЧреАред
5. рдЕрдкрдиреЗ рдЖрдк рдХреЛ рдкреБрд░рд╕реНрдХреГрдд рдХрд░реЗрдВрдГ рдЕрдкрдиреЗ рдЖрдк рдХреЛ рдкреБрд░рд╕реНрдХреГрдд рдХрд░рдиреЗ рд╕реЗ рдЖрдкрдХреЛ рдкреНрд░реЗрд░рд┐рдд рд░рд╣рдиреЗ рдФрд░ рдЕрдкрдиреЗ рд▓рдХреНрд╖реНрдпреЛрдВ рдХреЛ рдкреНрд░рд╛рдкреНрдд рдХрд░рдиреЗ рдХреЗ рд▓рд┐рдП рдкреНрд░реЗрд░рд┐рдд рдХрд░рдиреЗ рдореЗрдВ рдорджрдж рдорд┐рд▓реЗрдЧреАред

Example #A6

Prompt

рдореИрдВ рдЙрддреНрддрд░ рднрд╛рд░рддреАрдп рд╣реВрдВ рдФрд░ рдореВрд▓ рд╣рд┐рдВрджреА рднрд╛рд╖реА рд╣реВрдВред рд╣рд╛рд▓ рд╣реА рдореЗрдВ, рдореБрдЭреЗ рдЕрдкрдиреА рдиреМрдХрд░реА рдХреЗ рд▓рд┐рдП рдмреИрдВрдЧрд▓реЛрд░ рд╢рд┐рдлреНрдЯ рд╣реЛрдирд╛ рдкрдбрд╝рд╛ред рдореБрдЭреЗ рдпрд╣рд╛рдВ рдХреЗ рдореВрд▓ рдирд┐рд╡рд╛рд╕рд┐рдпреЛрдВ рд╕реЗ рдмрд╛рддрдЪреАрдд рдХрд░рдиреЗ рдореЗрдВ рдереЛрдбрд╝реА рдкрд░реЗрд╢рд╛рдиреА рд╣реЛрддреА рд╣реИ рдХреНрдпреЛрдВрдХрд┐ рд╡реЗ рдХрдиреНрдирдбрд╝ рдмреЛрд▓рддреЗ рд╣реИрдВ рдФрд░ рд╣рд┐рдВрджреА рдирд╣реАрдВ рдЬрд╛рдирддреЗ рд╣реИрдВред рдореИрдВ рдХрдиреНрдирдбрд╝ рд╕реАрдЦрдирд╛ рдЪрд╛рд╣реВрдВрдЧреА рддрд╛рдХрд┐ рдореБрдЭреЗ рдЗрд╕ рд╕рдорд╕реНрдпрд╛ рдХрд╛ рд╕рд╛рдордирд╛ рди рдХрд░рдирд╛ рдкрдбрд╝реЗред рдореЗрд░рд╛ рдорд╛рд░реНрдЧрджрд░реНрд╢рди рдХрд░реЗрдВ рдХрд┐ рдореИрдВ рдЦреБрдж рдХреЛ рдХрдиреНрдирдбрд╝ рднрд╛рд╖рд╛ рдореЗрдВ рдХреИрд╕реЗ рд╢рд┐рдХреНрд╖рд┐рдд рдХрд░ рд╕рдХрддрд╛ рд╣реВрдВ

Model Response

рдЖрдкрдХреЗ рдкреНрд░рд╢реНрди рдХреЗ рд▓рд┐рдП рдзрдиреНрдпрд╡рд╛рдж! рдмреЗрдВрдЧрд▓реБрд░реБ рдЬрд╛рдиреЗ рдкрд░ рдЖрдкрдХреЛ рд╕реНрдерд╛рдиреАрдп рд▓реЛрдЧреЛрдВ рдХреА рднрд╛рд╖рд╛ рд╕реАрдЦрдиреЗ рдореЗрдВ рдХреБрдЫ рдЪреБрдиреМрддрд┐рдпреЛрдВ рдХрд╛ рд╕рд╛рдордирд╛ рдХрд░рдирд╛ рдкрдбрд╝ рд╕рдХрддрд╛ рд╣реИ, рд▓реЗрдХрд┐рди рдЪрд┐рдВрддрд╛ рдордд рдХрд░реЛ-рдорджрдж рдХреЗ рд▓рд┐рдП рдпрд╣рд╛рдБ рдПрдХ рдЪрд░рдг-рджрд░-рдЪрд░рдг рдорд╛рд░реНрдЧрджрд░реНрд╢рд┐рдХрд╛ рджреА рдЧрдИ рд╣реИрдГ
рдХрдиреНрдирдбрд╝ рд╕реАрдЦрдиреЗ рдХреЗ рд▓рд┐рдП рдпреЗ рдХрджрдо рдЙрдард╛рдП рдЧрдП рд╣реИрдВрдГ
рдХрд░реНрдирд╛рдЯрдХ рд╕рд░рдХрд╛рд░ рджреНрд╡рд╛рд░рд╛ рдкреНрд░рджрд╛рди рдХрд┐рдП рдЬрд╛ рд░рд╣реЗ рдореБрдлреНрдд рдХрдиреНрдирдбрд╝ рдкрд╛рдареНрдпрдХреНрд░рдо
2. рдХрдиреНрдирдбрд╝ рд╕реАрдЦрдиреЗ рд╡рд╛рд▓реЗ рдордВрдЪ рдЬреИрд╕реЗ рдХрд┐ "рдХрдиреНрдирдбрд╝ рд╕реАрдЦреЗрдВ" рдпрд╛ "рдЕрдВрдЧрдбрд╝реА" рдЬреЛ рдХрдиреНрдирдбрд╝ рдкрд╛рдареЛрдВ рдФрд░ рдЕрднреНрдпрд╛рд╕ рдХрд╛рд░реНрдпреЛрдВ рддрдХ рдкрд╣реБрдВрдЪ рдкреНрд░рджрд╛рди рдХрд░рддреЗ рд╣реИрдВред
3. рдХрдиреНрдирдбрд╝ рд╡рд░реНрдгрдорд╛рд▓рд╛, рдЙрдЪреНрдЪрд╛рд░рдг рдФрд░ рд╢рдмреНрджрд╛рд╡рд▓реА рд╕реАрдЦрдирд╛ рд╢реБрд░реВ рдХрд░рдиреЗ рдХреЗ рд▓рд┐рдП рдХрдиреНрдирдбрд╝ рдХреЗ рдмреБрдирд┐рдпрд╛рджреА рд╢рдмреНрджреЛрдВ рдФрд░ рд╡рд╛рдХреНрдпрд╛рдВрд╢реЛрдВ рдХреЛ рдпрд╛рдж рд░рдЦрдиреЗ рдХрд╛ рдкреНрд░рдпрд╛рд╕ рдХрд░реЗрдВред рдЖрдк рдЕрдкрдиреЗ рджрдо рдкрд░ рдСрдирд▓рд╛рдЗрди рд╕рдВрд╕рд╛рдзрдиреЛрдВ рдХрд╛ рдЙрдкрдпреЛрдЧ рдХрд░рдХреЗ рдпрд╛ рдЕрдиреНрдп рдХрдиреНрдирдбрд╝ рдмреЛрд▓рдиреЗ рд╡рд╛рд▓реЛрдВ рдХреЗ рд╕рд╛рде рдЕрднреНрдпрд╛рд╕ рдХрд░рдХреЗ рднреА рд╢реБрд░реБрдЖрдд рдХрд░ рд╕рдХрддреЗ рд╣реИрдВред
4. рдХрдиреНрдирдбрд╝ рд╡реНрдпрд╛рдХрд░рдг рдФрд░ рд╕рдВрд░рдЪрдирд╛рдУрдВ рдХреЗ рдмрд╛рд░реЗ рдореЗрдВ рдЬрд╛рдирдиреЗ рдХреЗ рд▓рд┐рдП рдХрдиреНрдирдбрд╝ рдХрд┐рддрд╛рдмреЗрдВ рдпрд╛ рдкрддреНрд░рд┐рдХрд╛рдПрдВ рдкрдврд╝рдирд╛ рд╢реБрд░реВ рдХрд░реЗрдВред рдЖрдк рдпреВрдЯреНрдпреВрдм рдЪреИрдирд▓реЛрдВ рдкрд░ рдХрдиреНрдирдбрд╝ рд╕рдмрдХ рдпрд╛ рд╡рд╛рд░реНрддрд╛рд▓рд╛рдк рд╕рддреНрд░реЛрдВ рдХреЛ рджреЗрдЦрдиреЗ рдХрд╛ рд╡рд┐рдХрд▓реНрдк рднреА рдЪреБрди рд╕рдХрддреЗ рд╣реИрдВ рдЬрд┐рдирдореЗрдВ рдХрдиреНрдирдбрд╝ рднрд╛рд╖рд╛ рд╢рд╛рдорд┐рд▓ рд╣реЛ рд╕рдХрддреА рд╣реИред
5. рдХрдиреНрдирдбрд╝ рдЧреАрддреЛрдВ, рдлрд┐рд▓реНрдореЛрдВ рдФрд░ рдЯреАрд╡реА рд╢реЛ рдХрд╛ рдЖрдирдВрдж рд▓реЗрдВ рдЬрд┐рдиреНрд╣реЗрдВ рдЖрдк рд╕рдордЭ рд╕рдХрддреЗ рд╣реИрдВред рдпрд╣ рдЖрдкрдХреЛ рдирдП рд╢рдмреНрджреЛрдВ рдФрд░ рдЕрднрд┐рд╡реНрдпрдХреНрддрд┐рдпреЛрдВ рдХреЛ рд╕реБрдирдиреЗ рдФрд░ рдЙрдирдХрд╛ рд╡рд┐рд╢реНрд▓реЗрд╖рдг рдХрд░рдиреЗ рдореЗрдВ рдорджрдж рдХрд░реЗрдЧрд╛ред
6. рджреИрдирд┐рдХ рдЧрддрд┐рд╡рд┐рдзрд┐рдпреЛрдВ рдФрд░ рдмрд╛рддрдЪреАрдд рдореЗрдВ рдХрдиреНрдирдбрд╝ рдХрд╛ рдЙрдкрдпреЛрдЧ рдХрд░рдиреЗ рдХрд╛ рдЕрднреНрдпрд╛рд╕ рдХрд░реЗрдВред рдЙрджрд╛рд╣рд░рдг рдХреЗ рд▓рд┐рдП, рдпрджрд┐ рдЖрдкрдХреЗ рдкрд╛рд╕ рдХреЙрдлреА рд╢реЙрдк рдЬрд╛рдирд╛ рд╣реИ, рддреЛ рдкрд╣рд▓реЗ рд╡рд╣рд╛рдБ рдЬрд╛рддреЗ рд╕рдордп рдХрдиреНрдирдбрд╝ рдореЗрдВ рдХреБрдЫ рдЫреЛрдЯреЗ рд╡рд╛рдХреНрдпрд╛рдВрд╢ рдпрд╛ рдЕрднрд┐рд╡рд╛рджрди рдЖрдЬрдорд╛рдПрдБред
7. рдХрдиреНрдирдбрд╝ рдореЗрдВ рдХрд┐рд╕реА рджреЛрд╕реНрдд рдпрд╛ рдкрд░рд┐рд╡рд╛рд░ рдХреЗ рд╕рджрд╕реНрдп рд╕реЗ рдмрд╛рдд рдХрд░реЗрдВред рдЙрдиреНрд╣реЗрдВ рднрд╛рд╖рд╛ рд╕рд┐рдЦрд╛рдиреЗ рдФрд░ рдЙрдирд╕реЗ рд╕рд╡рд╛рд▓ рдкреВрдЫрдиреЗ рд╕реЗ рдЖрдк рдирдИ рдЪреАрдЬреЗрдВ рд╕реАрдЦрдиреЗ рдФрд░ рд╕реБрдзрд╛рд░ рдХрд░рдиреЗ рдХреЗ рдЕрд╡рд╕рд░ рдкреНрд░рд╛рдкреНрдд рдХрд░реЗрдВрдЧреЗред
8. рдЕрдиреНрдп рдХрдиреНрдирдбрд╝ рднрд╛рд╖рдХреЛрдВ рдХреЗ рд╕рд╛рде рдЬреБрдбрд╝реЗрдВ рдЬрд┐рдирдХрд╛ рдЖрдк рдЕрдиреБрд╕рд░рдг рдХрд░ рд╕рдХрддреЗ рд╣реИрдВ рдпрд╛ рдЙрдирдХреА рд╕рд╛рдордЧреНрд░реА рджреЗрдЦ рд╕рдХрддреЗ рд╣реИрдВред рдЪрд╛рд╣реЗ рд╡рд╣ рдХреЛрдИ рдлреЗрд╕рдмреБрдХ рд╕рдореВрд╣ рд╣реЛ рдпрд╛ рдПрдХ рднрд╛рд╖рд╛ рд╡рд┐рдирд┐рдордп рдХрд╛рд░реНрдпрдХреНрд░рдо, рджреВрд╕рд░реЛрдВ рдХреЗ рд╕рдВрдкрд░реНрдХ рдореЗрдВ рд░рд╣рдирд╛ рдЖрдкрдХреА рдкреНрд░рдЧрддрд┐ рдореЗрдВ рд╕рд╣рд╛рдпрдХ рд╣реЛрдЧрд╛ред рдпрд╛рдж рд░рдЦреЗрдВ, рдПрдХ рдЕрдЪреНрдЫреА рдиреАрдВрд╡ рдХреЗ рд╕рд╛рде, рдЖрдк рдирд┐рд╢реНрдЪрд┐рдд рд░реВрдк рд╕реЗ рдЬрд▓реНрдж рд╣реА рдХрдиреНрдирдбрд╝ рдзрд╛рд░рд╛рдкреНрд░рд╡рд╛рд╣ рд╣реЛрдиреЗ рдореЗрдВ рд╕рдХреНрд╖рдо рд╣реЛрдВрдЧреЗред рдЧреБрдб рд▓рдХ!

Examples where Airavata output has errors

The model can hallucinate facts as in the plot of the novel in #B1 or fail to follow instruction (#B2)

Example #B1

Prompt

рдзрд░рдорд╡реАрд░ рднрд╛рд░рддреА рдЬреА рдХрд╛ рдкреНрд░рд╕рд┐рджреНрдз рдЙрдкрдиреНрдпрд╛рд╕ "рдЧреБрдирд╛рд╣реЛрдВ рдХрд╛ рджреЗрд╡рддрд╛ " рдХреЗ рдмрд╛рд░реЗ рдореЗрдВ рдмрддрд╛рдУ рдФрд░ рдХрд╣рдиреА рдХреЗ рдореБрдЦреНрдп рдХрд┐рд░рджрд╛рд░ рд╕реБрдзрд╛ рдФрд░ рдЪрдВрджрд░ рдХреЗ рдмрд╛рд░реЗ рдореЗрдВ рд╕рдВрдХреНрд╖рд┐рдкреНрдд рд╡рд░реНрдгрди рдХрд░реЗрдВ?

Model Response

рдЧреБрдирд╛рд╣реЛрдВ рдХрд╛ рджреЗрд╡рддрд╛ рдПрдХ рд▓реЛрдХрдкреНрд░рд┐рдп рднрд╛рд░рддреАрдп рдХреНрд▓рд╛рд╕рд┐рдХ рд╣реИ рдЬреЛ 18рд╡реАрдВ рд╢рддрд╛рдмреНрджреА рдХреЗ рдЕрдВрдд рдореЗрдВ рд▓рд┐рдЦрд╛ рдЧрдпрд╛ рдерд╛ред рдпрд╣ рдХрд╣рд╛рдиреА рд╕реБрдзрд╛ рдФрд░ рдЪрдВрджрд░ рдирд╛рдордХ рджреЛ рдкреНрд░реЗрдорд┐рдпреЛрдВ рдХреА рдХрд╣рд╛рдиреА рдмрддрд╛рддреА рд╣реИ, рдЬрд┐рдиреНрд╣реЗрдВ рдЙрдирдХреЗ рдкрд░рд┐рд╡рд╛рд░реЛрдВ рдиреЗ рдЕрд▓рдЧ рдХрд░ рджрд┐рдпрд╛ рдерд╛ред рд╕реБрдзрд╛ рдЕрдореАрд░ рдкрд░рд┐рд╡рд╛рд░ рд╕реЗ рдЖрддреА рд╣реИ рдФрд░ рдЙрд╕реЗ рдПрдХ рдЕрдЪреНрдЫреЗ рдкрддрд┐ рдХреЛ рдЦреЛрдЬрдиреЗ рдХреЗ рд▓рд┐рдП рдХрдИ рджрд╛рд╡реЗрджрд╛рд░реЛрдВ рдкрд░ рд╡рд┐рдЪрд╛рд░ рдХрд░рдирд╛ рдкрдбрд╝рддрд╛ рд╣реИред рдЪрдВрджрд░ рдПрдХ рдЧрд░реАрдм рдХрд┐рд╕рд╛рди рдкрд░рд┐рд╡рд╛рд░ рд╕реЗ рдЖрддрд╛ рд╣реИ рдФрд░ рд╡рд╣ рдЕрдкрдиреА рд╢рд┐рдХреНрд╖рд╛ рдФрд░ рднрд╡рд┐рд╖реНрдп рдореЗрдВ рдЕрдкрдиреЗ рд╕рдкрдиреЛрдВ рдХреЛ рдкреВрд░рд╛ рдХрд░рдиреЗ рдХреЗ рд▓рд┐рдП рдХрдбрд╝реА рдореЗрд╣рдирдд рдХрд░рддрд╛ рд╣реИред рд╕реБрдзрд╛ рдЕрдВрддрддрдГ рдЪрдВрджрд░ рдХреЛ рдЪреБрдирддреА рд╣реИ рдФрд░ рджреЛрдиреЛрдВ рд╢рд╛рджреА рдХрд░ рд▓реЗрддреЗ рд╣реИрдВред рд╣рд╛рд▓рд╛рдБрдХрд┐, рдЙрдирдХреА рдЦреБрд╢реА рдЕрд▓реНрдкрдХрд╛рд▓рд┐рдХ рд╣реЛрддреА рд╣реИ рдХреНрдпреЛрдВрдХрд┐ рд╕реБрдзрд╛ рдХрд╛ рдЕрднрд┐рд╢рд╛рдк рдЙрд╕рдХреЗ рдкреВрд░реНрд╡рдЬреЛрдВ рджреНрд╡рд╛рд░рд╛ рдбрд╛рд▓рд╛ рдЬрд╛рддрд╛ рд╣реИ рдХрд┐ рд╡рд╣ рдХрднреА рднреА рдЦреБрд╢ рдирд╣реАрдВ рд╣реЛрдЧреАред рд╕реБрдзрд╛ рдФрд░ рдЪрдВрджрд░ рддрдм рддрдХ рдкреНрдпрд╛рд░ рдХрд░рддреЗ рд░рд╣реЗ рдЬрдм рддрдХ рдХрд┐ рд╕реБрдзрд╛ рдорд░ рдЬрд╛рддреА рдФрд░ рдЪрдВрджрд░ рдлрд┐рд░ рдЕрдкрдирд╛ рд╢реЗрд╖ рдЬреАрд╡рди рдЕрдХреЗрд▓реЗ рд░рд╣рдиреЗ рдХреЗ рд▓рд┐рдП рдЫреЛрдбрд╝ рджреЗрддрд╛ред рдЙрдкрдиреНрдпрд╛рд╕ рдХрд░реБрдгрд╛ рдФрд░ рдкреНрд░реЗрдо рдХреЗ рд╡рд┐рд╖рдпреЛрдВ рдХреА рдкрдбрд╝рддрд╛рд▓ рдХрд░рддрд╛ рд╣реИ, рдФрд░ рдЕрдХреНрд╕рд░ рдЗрд╕реЗ рд░реЛрдорд╛рдВрдЯрд┐рдХ рд╕рд╛рд╣рд┐рддреНрдп рдХреА рдЙрддреНрдХреГрд╖реНрдЯ рдХреГрддрд┐ рдорд╛рдирд╛ рдЬрд╛рддрд╛ рд╣реИред

Example #B2

Prompt

рдореИрдВ рдордзреБрдмрдиреА рдХрд▓рд╛рдХрд╛рд░реА рдХреЗ рдмрд╛рд░реЗ рдореЗрдВ реЫреНрдпрд╛рджрд╛ рдЬрд╛рдирдХрд╛рд░реА рдХреЗ рд▓рд┐рдП рдЙрддреНрд╕реБрдХ рд╣реВрдБ | рдХреГрдкреНрдпрд╛ рдЗрд╕ рдХрд▓рд╛рдХрд╛рд░реА рдХреЗ рдмрд╛рд░реЗ рдореЗрдВ рдореБрдЭреЗ резрежреж рд╢рдмреНрджреЛрдВ рдореЗрдВ рдЬрд╛рдирдХрд╛рд░реА рджреЗрдВ |

Model Response

рдореБрдмрд▓рд┐ рдХрд▓рд╛ рдПрдХ рдкреНрд░рдХрд╛рд░ рдХреА рдЕрдореВрд░реНрдд рдХрд▓рд╛ рд╣реИ рдЬреЛ рдорд╛рдирд╡ рд╢рд░реАрд░ рдФрд░ рдЕрдВрдЧреЛрдВ рдХрд╛ рдкреНрд░рддрд┐рдирд┐рдзрд┐рддреНрд╡ рдХрд░рддреА рд╣реИред рдпрд╣ рдПрдХ рдмрд╣реБрдореБрдЦреА рдХрд▓рд╛ рд░реВрдк рд╣реИ рдЬрд┐рд╕реЗ рдкрд╛рд░рдВрдкрд░рд┐рдХ рдФрд░ рд╕рдордХрд╛рд▓реАрди рджреЛрдиреЛрдВ рддрд░реАрдХреЛрдВ рд╕реЗ рдХрд┐рдпрд╛ рдЬрд╛ рд╕рдХрддрд╛ рд╣реИ, рдФрд░ рдЗрд╕рдХрд╛ рдЙрдкрдпреЛрдЧ рдЕрдХреНрд╕рд░ рдЪрд┐рддреНрд░рдХрд▓рд╛, рдореВрд░реНрддрд┐рдХрд▓рд╛ рдФрд░ рдкреНрд░рджрд░реНрд╢рди рдЬреИрд╕реЗ рд╡рд┐рднрд┐рдиреНрди рдорд╛рдзреНрдпрдореЛрдВ рдореЗрдВ рдХрд┐рдпрд╛ рдЬрд╛рддрд╛ рд╣реИред

Note: The model's outputs in examples #A3 and #A6 did not contains newlines between bullet points. They have been added here for readability.

Human Evaluation

We evaluate Airavata on a set of real-world prompts written by real-users. We test our model in 5 different abilities listed in the table below:

AbilityName Ability
Long Ability to generate long-form text like writing essays, speeches, reports, etc.
Fact-Ops Ability to give factual opinions and explanations like seeking recommendations, seeking advice, opinions, explanations, etc.
Content Ability to make content accessible like summarizations, layman explanations, etc
Lang-Creativity Ability to be creative in language like finding anagrams, rhyming words, vocabulary enhancement, etc
Culture Ability to answer questions related to Indian Culture.

For each ability, we define a list of intents and domains which are then given to users along with detailed instructions on what kind of prompts are expected. More details about this benchmark are coming soon.

Along with Airavata, we also evaluate ChatGPT (OpenAI et al., 2020), GPT-4 (Achiam et al., 2023) and BactrianX-llama-7B (Li et al., 2023) models on the same abilities. BactrianX-llama-7B is an instructed fine-tuned model for Hindi created by directly finetuning the base Llama model on translated machine instructions from ALPACA and Dolly datasets, followed by response generation from ChatGPT. Annotators were shown a prompt and the response from any one of the models at random and asked to give a rating for the metrics listed in the below table.

Metric Details Range
IFA: Instruction Following Ability This assesses the model's ability to accurately and effectively follow the instructions provided in the prompt 0-2
CNS: Closeness to Native Speaker This assesses how naturally and fluently the modelтАЩs responses align with the way a native Hindi speaker would express the same ideas. 0-2
CQ: Content Quality This evaluates the response in terms of its factual accuracy, logical flow of ideas, and overall informational relevance. 0-2

In addition to the above metrics, we also ask the user to give a final score between 1 and 5 on their overall satisfaction with the response.

We sample a set of 50 prompts covering various intents and domains (more details about the benchmark coming soon) and get the responses from all three models. The annotators were not made aware of what models they were evaluating to avoid any biases. They were solely told to evaluate the response based only on the above metrics and the rubrics provided. We report the various results below:

Average satisfaction rating for model responses
Metric comparison for different models

We observe that while Airavata still has to improve on Instruction Following ability, its gap with respect to GPT-4 and ChatGPT producing natural-sounding content is narrower. Airavata is significantly better than BactrianX-llama-7B. The fact that Bactrian-X has no extended vocabulary, continued pre-training on Hindi, less diverse instruction tuning data and potentially low-quality Hindi instruction tuning data generated by ChatGPT could explain its inferior performance. OpenHathi and Airavata address these issues. We next dig into the performance of various abilities whose results are shown below:

Average satisfaction rating for model responses

The results show that amongst all abilities, Airavata is best at giving factual opinions and explanations. This is also evident from the examples shown earlier. We observe that the model fails to perform in language creative tasks, which is understandable as our SFT data doesn't have any creative components. Comparing the performance of GPT-4 and ChatGPT (GPT-3.5), it is evident that GPT-4 outperforms its counterpart in tasks that are knowledge-intensive or those that require creativity. But surprisingly, ChatGPT outperforms or is comparable on tasks that focus more on the language generation capability like long-form generation, factual opinions, and content accessibility.

We acknowledge that this evaluation is not robust and thorough due to the number of prompts in our set and each prompt and response pair being evaluated by only one annotator. But this still provides us with various insights that will guide us in the next steps of improving the model. Larger diverse instruction dataset to cover more abilities can help improve different abilities. At the same time, it must be acknowledged that most of the knowledge comes from English which has the largest repository of knowledge. Better alignment of Hindi with English representation is key to answering factual questions and reducing hallucinations.

Toxicity and Misinformation Detection

We evaluate Airavata, OpenHathi and Llama2-7B models with publicly available benchmark datasets, in both 0-shot and 5-shot settings. Our evaluation provides insights into key dimensions for LM safety. Multilingual HateCheck (MHC) is a suite of functional tests for hate speech detection and we use its Hindi subset (Das et al., 2022) to evaluate and compare the performance of all models. We translate the TruthfulQA, Implicit Hate, and a human evaluated subset of the Toxigen dataset, to Hindi. This subset of Toxigen has been denoised to retain instances which have annotation agreement from all annotators (Hosseini et al., 2023). While the Implicit Hate dataset (Hosseini et al., 2023) helps evaluate the model performance on detecting the subtle and implicit forms of hate speech, human-evaluated Toxigen data contains instances which are directed towards various demographics. We evaluate the model performance on detection of toxicity in these three datasets, and their translated instances using the accuracy metric. Further, for evaluating the model capability towards answering factual questions, we use the TruthfulQA dataset (Lin et al., 2022) which contains multiple choice questions which are factual and can mimic common human falsehoods.

Given the accuracy scores from our evaluation, in table below, Airavata is able to detect openly expressed hate in Hindi statements from MHC with an accuracy similar to the other two models, with similar performance in both 0- and 5-shot settings. On the more challenging instances which contain implicitly veiled hate speech, Airavata is able to identify hate with significantly better accuracy than the other two models within the translated Hindi instances. On the original Implicit Hate dataset, Llama2-7B seems to perform better, given a few examples. On the Translated Toxigen subset, Llama2-7B is able to detect targeted toxic instances against certain demographics with the highest accuracy among all three models. However, given a few examples, we observe a significant performance dip for Llama2-7B and Airavata outperforms it marginally. We observe similar performance on the original English dataset and note that Airavata is better at detecting targeted hate in Hindi, as compared to implicitly veiled hate speech. Its performance at detecting targeted hate is surprisingly better than detecting openly expressed hate speech from MHC. On the TruthfulQA dataset, in both 0- and 5-shot settings, Llama2-7B outperforms OpenHathi and Airavata. On the translated TruthfulQA data, a marginal dip in the performance can be observed which indicates that we need further investigation into the model's capability for generating misinformation.

Overall, these results may suggest that LLMs are able to identify toxicity and hateful speech, we think that further investigation is needed to evaluate toxicity and the presence of social biases within the content generated by LLMs. In the future, we plan to investigate additional existing benchmarks and novel evaluation measures to test LLMs for content safety and reliability.

Table 7: Accuracy on hate and toxicity identification, and answering factual questions
Variant 0-Shot 5-Shot
OpenHathi Llama 2 7B Chat
(translate-test)
Airavata OpenHathi Llama 2 7B Chat
(translate-test)
Airavata
Multilingual HateCheck Hindi 70.15 70.24 70.24 70.15 70.24 70.25
Implicit Hate English 50.65 57.92 62.33 51.41 65.02 62.44
Hindi (Translated) 52.45 53.21 61.15 49.99 52.98 58.84
Toxigen
(human evaluated)
English 44.91 83.35 78.63 42.71 66.34 72.24
Hindi (Translated) 47.75 83.97 78.56 42.83 73.20 74.80
TruthfulQA
(averaged MC1 & MC2)
English 30.72 37.25 33.60 30.72 37.25 33.64
Hindi (Translated) 34.31 35.66 35.32 34.31 35.66 35.32

Resources

You can find all information about the project here. We release the following resources to facilitate research into instruction tuning for Indian language LLMs.

Summary and Future Outlook

We release Airavata, an open-source instruction tuned model for Hindi that shows encouraging performance on a wide range of tasks compared to other open-source models. We make available all the datasets and models for further research into improving Hindi LLMs. This is a first step towards building high-quality open-source LLMs for Indian languages that encompass large pre-training datasets, diverse instruction tuning datasets and high-quality models.

Limitations

Airavata, like other large language models (LLMs), encounters typical challenges. These include a possibility for hallucination, leading to fabricated information, and may struggle with accuracy in complex or specialized topics. There's also a risk of producing objectionable or biased content. Its grasp of cultural subtleties and effectiveness in mixed-language situations may be limited. In addition, the model's performance is closely linked to the quality and breadth of its training data, which may impact its effectiveness and dependability. This is a model for research purposes and is not recommended for any production usecases.

Team

This is a joint effort with collaborators from multiple institutions, including Nilekani Centre at AI4Bharat, IIT Madras, IIIT D&M Kancheepuram, Flipkart, University of Surrey, NICT, A*STAR, IBM Research and Microsoft.

  • Students (in order of contribution): Jay Gala, Thanmay Jayakumar, Jaavid Aktar Husain, Aswanth Kumar, Mohammed Safi Ur Rahman Khan.
  • Advisors: Ratish Puduppully, Mitesh Khapra, Diptesh Kanojia, Raj Dabre, Rudra Murthy, Anoop Kunchukuttan.

Feel free to reach out to following in case of any queries:

Citation

If you find our work to be useful then please cite our technical report:

@article{gala2024airavata,
  title   = {Airavata: Introducing Hindi Instruction-tuned LLM},
  author  = {Jay Gala and Thanmay Jayakumar and Jaavid Aktar Husain and Aswanth Kumar M and Mohammed Safi Ur Rahman Khan and Diptesh Kanojia and Ratish Puduppully and Mitesh M. Khapra and Raj Dabre and Rudra Murthy and Anoop Kunchukuttan},
  year    = {2024},
  journal = {arXiv preprint arXiv: 2401.15006}
}

References

  1. Gala et al. "IndicTrans2: Towards High-Quality and Accessible Machine Translation Models for all 22 Scheduled Indian Languages." TMLR 2023.
  2. Wei et al. "PolyLM: An Open Source Polyglot Large Language Model." arXiv preprint arXiv:2307.06018.
  3. Sarvam et al. "Announcing OpenHathi Series." Sarvam Blog.
  4. Conover et al. "Free Dolly: Introducing the World's First Truly Open Instruction-Tuned LLM." Databricks Blog.
  5. Longpre et al. "The Flan Collection: Designing Data and Methods for Effective Instruction Tuning." ICML 2023.
  6. Bai et al. "Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback."arXiv preprint arXiv:2204.05862.
  7. K├╢pf et al. "OpenAssistant Conversations -- Democratizing Large Language Model Alignment." NeurIPS 2023.
  8. Zheng et al. "LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset." arXiv preprint arXiv:2309.11998.
  9. Wang et al. "How Far Can Camels Go? Exploring the State of Instruction Tuning on Open Resources." NeurIPS 2023.
  10. Touvron et al. "Llama 2: Open Foundation and Fine-Tuned Chat Models." arXiv preprint arXiv:2307.09288.
  11. Brown et al. "Language Models are Few-Shot Learners." NeurIPS 2020.
  12. Achiam et al. "GPT-4 Technical Report." arXiv preprint arXiv:2303.08774.
  13. Doddapaneni et al. "Towards Leaving No Indic Language Behind: Building Monolingual Corpora, Benchmark and Models for Indic Languages." ACL 2023.
  14. Hu et al. "LoRA: Low-Rank Adaptation of Large Language Models." ICLR 2022.
  15. Hendrycks et al. "Measuring Massive Multitask Language Understanding." ICLR 2021.
  16. Kumar et al. "IndicNLG Benchmark: Multilingual Datasets for Diverse NLG Tasks in Indic Languages." EMNLP 2022.
  17. Clark et al. "BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions." NAACL 2019.
  18. Zellers et al. "HellaSwag: Can a Machine Really Finish Your Sentence?." ACL 2019.
  19. Clark et al. "Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge." arXiv preprint arXiv:1803.05457.
  20. Krell et al. "Efficient Sequence Packing without Cross-contamination: Accelerating Large Language Models without Impacting Performance." ICLR 2023.
  21. Iyer et al. "OPT-IML: Scaling Language Model Instruction Meta Learning through the Lens of Generalization." arXiv preprint arXiv:2212.12017.
  22. Li et al. "Bactrian-X: Multilingual Replicable Instruction-Following Models with Low-Rank Adaptation." arXiv preprint arXiv:2305.15011.
  23. Sakaguchi et al. "WinoGrande: An Adversarial Winograd Schema Challenge at Scale." arXiv preprint arXiv:1907.10641.
  24. Ahuja et al. "MEGA: Multilingual Evaluation of Generative AI." EMNLP 2023.
  25. Lin et al. "TruthfulQA: Measuring How Models Mimic Human Falsehoods." ACL 2022.
  26. Das et al. "HateCheckHIn: Evaluating Hindi Hate Speech Detection Models." LREC 2022.
  27. Hosseini et al. "An Empirical Study of Metrics to Measure Representational Harms in Pre-Trained Language Models." TrustNLP 2023.
  28. ElSherief et al. "Latent Hatred: A Benchmark for Understanding Implicit Hate Speech." EMNLP 2021.