:bookmark: The Indic NLP Catalog
A Collaborative Catalog of Resources for Indic Language NLP
The Indic NLP Catalog repository is an attempt to collaboratively build the most comprehensive catalog of NLP datasets, models and other resources for all languages of the Indian subcontinent.
Please suggest any other resources you may be aware of. Raise a pull request or an issue to add more resources to the catalog. Put the proposed entry in the following format:
[Wikipedia Dumps](https://dumps.wikimedia.org/)
Add a small, informative description of the dataset and provide links to any paper/article/site documenting the resource. Mention your name too. We would like to acknowlege your contribution to building this catalog in the CONTRIBUTORS list.
:+1: Featured Resources
Indian language NLP has come a long way. We feature a few resources that are illustrative of the trends in recent times along various axes and point to a bright future.
- Universal Language Contribution API (ULCA): ULCA is a standard API and open scalable data platform (supporting various types of datasets) for Indian language datasets and models. ULCA is part of the Bhasini mission. You can upload, discover models, datasets and benchmarks here. This is one repository we really need and hope to see this evolving into a standard, large-scale platform for resource discovery and dissemination.
- We are seeing the rise of large-scale datasets across many tasks like IndicCorp (text corpus/9 billion tokens), Samanantar (parallel corpus/50 million sentence pairs), Naamapadam (named entity/5.7 million sentences), HiNER (named entity/100k sentences), Aksharantar (transliteration/26 million pairs) , etc. These are being built using either large-scale mining of web-resource or large human annotation efforts or both.
- As we aim higher, the datasets and models are achieving higher language coverage. While earlier datasets would be available for only a handful of Indian languages, then for 10-12 languages - we are now reaching the next frontier where we are creating resources like Aksharantar (transliteration/21 languages), FLORES-200 (translation/27 languages), IndoWordNet (wordnet/18 languages) spanning almost all languages listed in the Indian constitution and more. Datasets and models spanning a large number of languages.
- Particularly, we are seeing datasets getting created for extremely low-resourced languages or languages not yet covered in any dataset like Bodo, Kangri, Khasi, etc.
- From a handful of institutes who pioneered the development of NLP in India, we now have an increasing number of institutes/interest groups and passionate volunteers like AI4Bharat, BUET CSE NLP, KMI, L3Cube, iNLTK, IIT Patna, etc. who are contributing to building resources for Indian languages.
Browse the entire catalog…
:raising_hand:Note: Many known resources have not yet been classified into the catalog. They can be found as open issues in the repo.
- Major Indic Language NLP Repositories
- Libraries and Tools
- Evaluation Benchmarks
- Standards
- Text Corpora
- Monolingual Corpus
- Language Identification
- Lexical Resources
- NER Corpora
- Parallel Translation Corpus
- MT Evaluation
- Parallel Transliteration Corpus
- Text Classification
- Textual Entailment/Natural Language Inference
- Paraphrase
- Sentiment, Sarcasm, Emotion Analysis
- Hate Speech and Offensive Comments
- Question Answering
- Dialog
- Discourse
- Information Extraction
- POS Tagged corpus
- Chunk Corpus
- Dependency Parse Corpus
- Co-reference Corpus
- Summarization
- Data to Text
- Models
- Speech Corpora
- OCR Corpora
- Multimodal Corpora
- Language Specific Catalogs
Major Indic Language NLP Repositories
- Universal Language Contribution API (ULCA)
- Technology Development for Indian Languages (TDIL)
- Center for Indian Language Technology (CFILT)
- Language Technologies Research Center (LTRC)
- AI4Bharat
- Linguistic Data Consortium For Indian Languages (LDCIL)
- University of Hyderabad - Sanskrit NLP
- National Platform for Language Technology
- BUET CSE NLP Group
- KMI Linguistics
- L3Cube
- IIT Patna
Libraries and Tools
- Indic NLP Library: Python Library for various Indian language NLP tasks like tokenization, sentence splitting, normalization, script conversion, transliteration, etc
- Devnagri to Roman transliteration using hand-crafted rules and lexicons.
- pyiwn: Python Interface to IndoWordNet
- Indic-OCR : OCR for Indic Scripts
- CLTK: Toolkit for many of the world’s classical languages. Support for Sanskrit. Some parts of the Sanskrit library are forked from the Indic NLP Library.
- iNLTK: iNLTK aims to provide out of the box support for various NLP tasks that an application developer might need for Indic languages.
- Sanskrit Coders Indic Transliteration: Script conversion and romanization for Indian languages.
- Smart Sanskirt Annotator: Annotation tool for Sanskrit paper
- BNLP: Bengali language processing toolkit with tokenization, embedding, POS tagging, NER suppport
- CodeSwitch: Language identification, POS Tagging, NER, sentiment analysis support for code mixed data including Hindi and Nepali language
- IndIE: An Open Information Extraction tool (triple extractor) in Hindi. It is conjectured to work for Tamil, Telugu, and Urdu as well.
- Hindi-BenchIE: A triple evaluation tool for 112 Hindi sentences.
Evaluation Benchmarks
Benchmarks spanning multiple tasks.
- AI4Bharat IndicGLUE: NLU benchmark for 11 languages.
- AI4Bharat IndicNLG Suite: NLG benchmark for 11 languages spanning 5 generation tasks: biography generation, sentence summarization, headline generation, paraphrase generation and question generation.
- GLUECoS: For Hindi-English code-mixed benchmark containing the following tasks - Language Identification (LID), POS Tagging (POS), Named Entity Recognition (NER), Sentiment Analysis (SA), Question Answering (QA), Natural Language Inference (NLI).
- AI4Bharat Text Classification: A compilation of classification datasets for 10 languages.
- WAT 2021 Translation Dataset: Standard train and test sets for translation between English and 10 Indian languages.
Standards
- Unicode Standard for Indic Scripts
Text Corpora
Monolingual Corpus
- Wikipedia Dumps
- Common Crawl
- OSCAR Corpus: Released in 2019, large-scaled processed CommonCrawl.
- WMT Common Crawl Dumps: Crawls between 2012 and 2016. Noisy text, needs to be filtered.
- CC-100 Corpus: Facebook CommonCrawl extracted data. They provide scripts for processing CommonCrawl. StatMT has built a replica of the CC-100 corpus using these scripts. You can find it HERE. This corpus also has romanized corpora for some Indian languages.
- WMT NEWS Crawl
- LDCIL Monolingual Corpus
- Charles University Hindi Monolingual Corpus
- Charles University Urdu Monolingual Corpus
- IIT Bombay Hindi Monolingual Corpus
- EMILLE Corpus (multiple Indian languages)
- Janmabhumi Malayalam Corpus
- Leipzig Corpus
- Sanskrit Monolingual and Sandhi-split Corpus
- Lot Of Indic Tweets Corpus: Large twitter datasets for telugu (7.9 million) and hindi (17.6 million) and fasttext skipgram and cbow word vectors for the same.
- CMU Romanized Hinglish Corpus: See THIS PAPER for details.
- JNU-BHLTR Bhojpuri Corpus: Bhojpuri corpus of 45k sentences.
- KMI Magahi Corpus:
- KMI Awadhi Corpus:
- KMI Linguistics Bodo: Contains the Bodo corpus and the frequency-ordered word and punctuation list.
- SMC Malayalam text corpus
- DNLP-Tel Telugu Corpus: Telugu corpus of 280M tokens and 23M sentences along with skip-gram model trained with word2vec.
- Ema-lon Manipuri Corpus: The first comparable corpus built for the Manipuri (mni)-English (eng) language pair with the monolingual data comprising of 1,034,715 Manipuri sentences and 846,796 English sentences in version 1 and 1,880,035 Manipuri sentences and 1,450,053 English sentences in version 2.
- SinMin Corpus: Contains texts of different genres and styles of the modern and old Sinhala language.
- Kangri_corpus: Monolingual corpus of Himachali low resource endangered language, Kangri comprising of 1,81,552 sentences. Described in this paper.
- Sanskrit-Hindi-MT: The Sanskrit Monolingual Data is available here.
- FacebookDecadeCorpora: Contains two language corpora of colloquial Sinhala content extracted from Facebook using the Crowdtangle platform. The larger corpus contains 28,825,820 to 29,549,672 words of text, mostly in Sinhala, English and Tamil and the smaller corpus amounts to 5,402,76 words of only Sinhala text extracted from Corpus-Alpha. Described in this paper.
- Nepali National corpus: The Nepali Monolingual written corpus comprises the core corpus containing 802,000 words and the general corpus containing 1,400,000 words. Described here.
Language Identification
- VarDial 2018 Language Identification Dataset: 5 languages - Hindi, Braj, Awadhi, Bhojpuri, Magahi.
Lexical Resources and Semantic Similarity
- IndoWordNet
- IIIT-Hyderabad Word Similarity Database: 7 Indian languages
- Facebook Hindi Analogy Dataset
- MGAD Hindi Analogy dataset
- AI4Bharat Word Frequency Lists: Tokens and their frequencies from the AI4Bharat corpus, a large monolingual corpus.
- Hindi RG-63: Hindi version of the Rubenstein and Goodenough (RG-65) word similarity dataset
- IITB Cognate Datasets: Dataset of Cognates and False Friend Pairs for 12 Indian Languages. (Paper)
- AI4Bharat Cross-lingual Semantic Textual Similarity: 10 sentences across 11 en-Indic language pairs annotated on a scale of 0-5 as per SemEval cross-lingual STS guidelines.
- Toxicity-200: Toxicity Lists for 200 languages including 27 Indian languages.
- FacebookDecadeCorpora: Contains a list of algorithmically derived stopwords extracted from Corpus-Sinhala-Redux. Described in this paper.
NER Corpora
- FIRE 2013 AUKBC NER Corpus
- FIRE 2014 AUKBC NER Corpus
- IIT Bombay Marathi NER Corpus
- WikiAnn NER Corpus (Noisy) DOWNLOAD (Old broken LINK)
- IJCNLP 200 NER Corpus: NER corpora for hi, bn, or, te, ur.
- a-mma NER data
- AI4Bharat Naamapadam: NER dataset for 11 Indic languages.
- AsNER: A named entity annotation dataset for low resource Assamese language containing 99k tokens.
- L3Cube-MahaNER: The first major gold standard named entity recognition dataset in Marathi consisting of 25,000 sentences in Marathi language. Described in this paper.
- CFILT HiNER: A large Hindi NER dataset containing 109,146 sentences and 2,220,856 tokens. Described in this paper.
- MultiCoNER: A multilingual complex Named Entity Recognition dataset composed of 2.3 million instances for 11 languages(including dataset for Indic languages Hindi and Bangla) representing three domains(wiki sentences, questions, and search queries) plus multilingual and code-mixed subsets.The NER tag-set consists of six classes viz.: PER,LOC,CORP,GRP,PROD and CW. Described in this paper.
Parallel Translation Corpus
- BPCC Parallel Corpus: Largest parallel corpus for English and 22 Indian languages (as of Jan 2024). It comprises 230 million sentence pairs between English-Indian languages. A subset of this corpus is the BPCC-Human Corpus containing 2.2 English-Indic pairs for 22 Indic languages.
- Samanantar Parallel Corpus: Largest parallel corpus for English and 11 Indian languages (as of 2021). It comprises 46m sentence pairs between English-Indian languages and 82m sentence pairs between Indian languages.
- FLORES-101: Human translated evaluation sets for 101 languages released by Facebook. It includes 14 Indic languages. The testsets are n-way parallel.
- FLORES-200: Human translated evaluation sets for 200 languages released by Facebook. It includes 24 Indic languages. The testsets are n-way parallel.
- IIT Bombay English-Hindi Parallel Corpus: Largest en-hi parallel corpora in public domain (about 1.5 million segments)
- CVIT-IIITH PIB Multilingual Corpus: Mined from Press Information Bureau for many Indian languages. Contains both English-IL and IL-IL corpora (IL=Indian language).
- CVIT-IIITH Mann ki Baat Corpus: Mined from Indian PM Narendra Modi’s Mann ki Baat speeches.
- PMIndia: Parallel corpus for En-Indian languages mined from Mann ki Baat speeches of the PM of India (paper).
- OPUS corpus
- WAT 2018 Parallel Corpus: There may significant overlap between WAT and OPUS.
- Charles University Parallel Corpora Collection
- Charles University English-Hindi Parallel Corpus: This is included in the IITB parallel corpus.
- Charles University English-Tamil Parallel Corpus
- Charles University English-Odia Parallel Corpus v1.0
- Charles University English-Odia Parallel Corpus v2.0
- Charles University English-Urdu Religious Parallel Corpus
- Indian Language Corpora Initiative: Available on TDIL portal on request
- IndoWordnet Parallel Corpus: Parallel corpora mined from IndoWordNet gloss and/or examples for Indian-Indian language corpora (6.3 million segments, 18 languages).
- MTurk Indian Parallel Corpus
- TED Parallel Corpus
- JW300 Corpus: Parallel corpus mined from jw.org. Religious text from Jehovah’s Witness.
- ALT Parallel Corpus: 10k sentences for Bengali, Hindi in parallel with English and many East Asian languages.
- FLORES dataset: English-Sinhala and English-Nepali corpora
- Uka Tarsadia University Corpus: 65k English-Gujarati sentence pairs. Corpus is described in this paper
- NLPC-UoM English-Tamil Corpus: 9k sentences, 24k glossary terms
- Wikititles: from statmt
- JNU-BHLTR Bhojpuri Corpus: English-Bhojpuri corpus of 65k sentences
- EILMT Corpus
- QED Corpus: English-Hindi corpus of 43k sentences from the educational domain.
- WikiMatrix Corpus: Mined from Wikipedia, looks noisy.
- CCMatrix: Parallel corpus mined from CommonCrawl, looks noisy (statmt repo).
- CGNetSwara: Hindi-Gondi parallel corpus (19k sentence pairs)
- MTEnglish2Odia: English-Odia (42k pairs)
- SAP Software Documentation: test and evaluation set for English-Hindi in the software documentation domain [paper]
- BUET English-Bangla Corpus, EMNLP-2020: 2.7M sentences (has overlaps with OPUS)
- CLE Parallel Corpus: Parallel corpus for English, Urdu and Nepali.
- Itihasa Parallel Corpus: 93k parallel sentences between English and Sanskrit from the Ramanyana and Mahabharata.
- Ema-lon Manipuri Corpus: The first comparable corpus built for the Manipuri (mni)-English (eng) language pair with parallel data comprising of 124,975 Manipuri-English aligned sentences.
- PHINC: A Parallel Hinglish Social Media Code-Mixed Corpus consisting of 13,738 code-mixed English-Hindi sentences and their corresponding translation in English. Described in this paper.
- IIIT-H en-hi-codemixed-corpus: A gold standard parallel corpus consisting of 6096 English-Hindi code-mixed sentences containing a total of 63,913 tokens and monolingual English. Described in this paper.
- CALCS 2021 Eng-Hinglish dataset: Eng-Hinglish parallel corpus containing 10k pairs of sentences. Described in this paper.
- Kangri_corpus: The corpus contains 27,362 Hindi-Kangri Parallel corpora. Described in [this paper] (https://arxiv.org/abs/2103.11596).
- NLLB-Seed: Small human-translated parallel corpora from Wikipedia articles for very low resource languages. Includes 5 Indian languages: Kashmiri, Manipuri, Maithili, Bhojpuri, Chattisgarhi.
- NLLB-MD: NLLB Multi Domain is a set of professionally-translated sentences in News, Unscripted informal speech, and Health domains. Cover Bhojpuri amongst Indian languages.
- NLLB-Mined: All the parallel corpora mined by the NLLB project. This repository was reconstructed by AllenAI based on metadata released by the NLLB Project.
- PHINC: A Parallel Hinglish Social Media Code-Mixed Corpus consisting of 13,738 code-mixed English-Hindi sentences and their corresponding translation in English. Described in this paper.
- Sanskrit-Hindi-MT: Machine Translation from Sanskrit to Hindi using Unsupervised and Supervised Learning. Contains Sanskrit-English parallel data and Sanskrit-Hindi parallel(test) data.
- Nepali National corpus: The English-Nepali Parallel Corpus consists of a small set of data aligned at the sentence level with 27,060 English words and 21,756 Nepali words and a larger set of texts at the document level with 617,340 English words and 596,571 Nepali words. An additional set of monolingual data is also provided with 386,879 words in Nepali. Described here.
- Kathmandu University-English–Nepali Parallel Corpus: A parallel corpus of size 1.8 million sentence pairs for a low resource language pair Nepali–English. Described in this paper.
- CCAligned: A Massive Collection of more than 100 million cross-lingual web-document pairs in 137 languages aligned with English.
- CoPara: Long-context parallel corpora for 4 Dravidian languages. Contains 2586 passage pairs mined from New India Samachar [paper]
MT Evaluation
- WMT23 QE task: QE datasets for 5 Indian languages in En to Indic directions (mr, hi, gu, ta, te) with DA annotations. The references are also available, so these can also be used for reference based metrics. For Marathi, post-edits are also available as are word-level annotations error annotations are also available. 26k training sentences for Marathi, 7k for the others. report
- AI4Bharat IndicMT-Eval: MT evaluation datasets for 5 Indian languages in En to Indic directions (mr, hi, gu, ta, ml) with Multidimensional Quality Metric (MQM) annotations. 1400 sentence annotations per language (200 sentences and outputs from 7 MT systems).
Parallel Transliteration Corpus
- Dakshina Dataset: The Dakshina dataset is a collection of text in both Latin and native scripts for 12 South Asian languages. Contains an aggregate of around 300k word pairs and 120k sentence pairs.
- BrahmiNet Corpus: 110 language pairs mined from ILCI parallel corpus.
- Xlit-Crowd: Hindi-English Transliteration Corpus created via crowdsourcing.
- Xlit-IITB-Par: Hindi-English Transliteration Corpus mined from parallel translation corpora.
- FIRE 2013 Track on Transliterated Search: Transliteration dataset of native words in Hindi, Bengali and Gujarati.
- NEWS 2018 Shared Task dataset: Transliteration datasets for Kannada, Tamil, Bengali and Hindi created by Microsoft Research India.
- AI4Bharat StoryWeaver Xlit Dataset - Transliteration datasets for Hindi, Maithili & Konkani
- Hindi WikiData Transliteration Pairs - Hindi dataset (90k pairs)
- NotAI-tech English-Telugu: Around 38k word pairs
- AI4Bharat Aksharantar: The largest publicly available transliteration dataset for 21 Indic languages consisting of 26M Indic language-English transliteration pairs. Described in this paper.
Text Classification
- BBC news articles classification dataset: 14 class classification
- iNLTK News Headlines classification: Datasets for multiple Indian languages.
- AI4Bharat IndicNLP News Articles: Word embeddings for 10 Indian languages.
- KMI Linguistics TRAC - 1: Contains aggression-annotated dataset (in English and Hindi) for the Shared Task on Aggression Identification during First Workshop on Trolling, Aggression and Cyberbullying (TRAC - 1) at COLING - 2018.
- XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning in 11 languages (includes Tamil). Described in this paper.
Textual Entailment/Natural Language Inference
- XNLI corpus: Hindi and Urdu test sets and machine translated training sets (from English MultiNLI).
- csebuetnlp Bangla NLI: A Natural Language Inference (NLI) dataset for Bengali. Described in this paper.
Paraphrase
- Amrita University-DPIL Corpus: Sentence level paraphrase identification for four Indian languages (Tamil, Malayalam, Hindi and Punjabi).
Sentiment, Sarcasm, Emotion Analysis
- IIT Bombay movie review datasets for Hindi and Marathi
- IIT Patna movie review datasets for Hindi
- IIIT-H LTRC Multi-domain dataset for Telugu
- ACTSA corpus for Telugu
- BHAAV (भाव) Corpus: A Text Corpus for Emotion Analysis from Hindi Stories
- SentiWordNet - SAIL - Hindi, Bangla, Tamil & Telugu
- Dravidian-CodeMix - FIRE 2020 - Tamil & Malayalam
- Bengali Sentiment Analysis - Classification Benchmark, 2020: 8k sentences
- SentNoB: sentiment dataset for Bangla from 3 domains on user comments containing 15k examples (Paper) (Dataset)
- UoM-Sinhala Sentiment Analysis: Sentiment Analysis for Sinhala Language. Consists of a multi-class annotated data set with 15059 sentiment annotated Sinhala news comments extracted from two Sinhala online news papers with four sentiment categories namely POSITIVE, NEGATIVE, NEUTRAL and CONFLICT and a corpus of 9.48 million tokens. Described in this paper.
Hate Speech and Offensive Comments
- Hate Speech and Offensive Content Identification in Indo-European Languages: (HASOC FIRE-2020)
- An Indian Language Social Media Collection for Hate and Offensive Speech, 2020: Hinglish Tweets and FB Comments collected during Parliamentary Election 2019 of India (Dataset available on request)
- Aggression-annotated Corpus of Hindi-English Code-mixed Data, 2018: Scraped from Facebook (21k) & Twitter (18k) (Paper)
- Did You Offend Me? Classification of Offensive Tweets in Hinglish Language, 2018: 3k tweets (Paper)
- A Dataset of Hindi-English Code-Mixed Social Media Text for Hate Speech Detection, 2018: 4.5k Tweets (Paper)
- Roman Urdu Offensive Language Detection, 2020: 10k tweets, can also used for Hindi, (Paper)
- Bengali Hate Speech - Classification Benchmark, 2020: 1.5k sentences
- Offensive Language Identification in Dravidian Languages, EACL 2021: Tamil, Malayalam, Kannada
- Fear Speech in Indian WhatsApp Groups, 2021
- HateCheckHIn: An evaluation dataset for Hindi Hate Speech Detection Models having a total of 34 functionalities out of which 28 functionalities are monolingual and the remaining 6 are multilingual. Hindi is used as the base language. Described in this paper.
Question Answering
- Facebook Multilingual QA datasets: Contains dev and test sets for Hindi.
- TyDi QA datasets: QA dataset for Bengali and Telugu.
- bAbi 1.2 dataset: Has Hindi version of bAbi tasks in romanized Hindi.
- MMQA dataset: Hindi QA dataset described in this paper
- XQuAD: testset for Hindi QA from human translation of subset of SQuAD v1.1. Described in this paper
- XQA: testset for Tamil QA. Described in this paper
- HindiRC: A Dataset for Reading Comprehension in Hindi containing 127 questions and 24 passages. Described in this paper
- IITH HiDG: A Distractor Generation Dataset for Hindi consisting of 1k/1k/5k (train/validation/test) split. Described in this paper
- Chaii a Kaggle challenge which consists of 1104 Questions in Hindi and Tamil. Moreover, here is a good collection of papers on multilingual Question Answering.
- csebuetnlp Bangla QA: A Question Answering (QA) dataset for Bengali. Described in this paper.
- XOR QA: A large-scale cross-lingual open-retrieval QA dataset (includes Bengali and Telugu) with 40k newly annotated open-retrieval questions that cover seven typologically diverse languages. Described in this paper. More information is available here.
- IITB HiQuAD: A question answering dataset in Hindi consisting of 6555 question-answer pairs. Described in this paper.
Dialog
- a-mma Indic Casual Dialogs Datasets
- A Code-Mixed Medical Task-Oriented Dialog Dataset: The dataset contains 3005 Telugu–English Code-Mixed dialogs with 29 k utterances covering ten specializations with an average code-mixing index (CMI) of 33.3%. Described in this paper.
Discourse
Information Extraction
- EventXtract-IL: Event extraction for Tamil and Hindi. Described in this paper.
- [EDNIL-FIRE2020]https://ednilfire.github.io/ednil/2020/index.html): Event extraction for Tamil, Hindi, Bengali, Marathi, English. Described in this paper.
- Amazon MASSIVE: A Multilingual Amazon SLURP (SLU resource package) for Slot Filling, Intent Classification, and Virtual-Assistant Evaluation containing one million realistic, parallel, labeled virtual-assistant text utterances spanning 51 languages, 18 domains, 60 intents, and 55 slots. Described in this paper.
- Facebook - MTOP Benchmark: A Comprehensive Multilingual Task-Oriented Semantic Parsing Benchmark with a dataset comprising of 100k annotated utterances in 6 languages(including Indic language: Hindi) across 11 domains. Described in this paper.
POS Tagged corpus
- Indian Language Corpora Initiative
- Universal Dependencies
- IIITH Paninian Treebank: POS annotations for hi, bn, kn, ml and mr.
- Code Mixed Dataset for Hindi, Bengali and Telugu, ICON 2016 shared task
- JNU-BHLTR Bhojpuri Corpus: Bhojpuri corpus of 5000 sentences.
- KMI Magahi Corpus:
- KMI Awadhi Corpus:
- Tham Khasi Corpus: An annotated Khasi POS tagged corpus containing 83,312 words, 4,386 sentences, 5,465 word types which amounts to 94,651 tokens (including punctuations).
Chunk Corpus
- Indian Language Corpora Initiative
- Indian Languages Treebanking Project: Chunk annotations for hi, bn, kn, ml and mr.
Dependency Parse Corpus
- IIIT Hyderabad Hindi Treebank
- Universal Dependencies
- Universal Dependencies Hindi Treebank
- Universal Dependencies Urdu Treebank
- IIITH Paninian Treebank: Paninian Grammar Framework annotations along with mappings to Stanford dependency annotations for hi, bn, kn, ml and mr.
- Vedic Sanskrit Treebank: 4k Sanskrit dependency treebank [paper]
Coreference Corpus
Summarization
- XL-Sum: A Large-Scale Multilingual Abstractive Summarization for 44 Languages with a comprehensive and diverse dataset comprising of 1 million professionally annotated article-summary pairs from BBC. Span 150k examples across 10 Indic languages. Described in this paper.
- TeSum: Telugu Abstractive Summarization dataset containing 20k+ article-summary pairs, with the summaries being manually created. [paper]
- WikiLingua: Cross-lingual summarization dataset created from WikiHow. Contains 9k English-Hindi article-summary pairs. [paper]
- MassiveSum: A large summarization dataset for containing 13 Indian languages with ~1.9million article-summary pairs. The summaries are mined from article metadata. [paper]
Data to Text
- XAlign: Cross-lingual Fact-to-Text Alignment and Generation for Low-Resource Languages comprising of a high quality XF2T dataset in 7 languages: Hindi, Marathi, Gujarati, Telugu, Tamil, Kannada, Bengali, and monolingual dataset in English. The dataset is available upon request. Described in this paper.
Models
Language Identification
- NLLB-200: LID for 200 languages including 27 Indic languages.
Word Embeddings
- AI4Bharat IndicFT: Fast-text word embeddings for 11 Indian languages.
- FastText CommonCrawl+Wikipedia
- FastText Wikipedia
- Polyglot
- EM-FT: The first FastText word embedding available for Manipuri language trained on 1,880,035 Manipuri sentences.
- Sanskrit-Hindi-MT: The FastText embeddings for Sanskrit is available here and for Hindi here.
- UoM-Sinhala Sentiment Analysis- FastText 300: The FastText word embedding model for Sinhala language. Described in this paper.
Pre-trained Language Models
- AI4Bharat IndicBERT: Multilingual ALBERT based embeddings spanning 12 languages for Natural Language Understanding (including Indian English).
- AI4Bharat IndicBART: A multilingual,sequence-to-sequence pre-trained model based on the mBART architecture focusing on 11 Indic languages and English for Natural Language Generation of Indic Languages. Described in this paper.
- MuRIL: Multilingual mBERT based embeddings spanning 17 languages and their transliterated counterparts for Natural Language Understanding (paper).
- BERT Multilingual: BERT model trained on Wikipedias of many languages (including major Indic languages).
- mBART50: seq2seq pre-trained model trained on CommonCrawl of many languages (including major Indic languages).
- BLOOM: GPT3 like multilingual transformer-decoder language model (includes major Indic languages.
- iNLTK: ULMFit and TransformerXL pre-trained embeddings for many languages trained on Wikipedia and some News articles.
- albert-base-sanskrit: ALBERT-based model trained on Sanskrit Wikipedia.
- RoBERTa-hindi-guj-san: Multilingual RoBERTa like model trained on Hindi, Sanskrit and Gujarati.
- Bangla-BERT-Base: Bengali BERT model trained on Bengali wikipedia and OSCAR datasets.
- BanglaBERT: Language Model Pretraining and Benchmarks for Low-Resource Language Understanding Evaluation in Bangla. Described in this paper.
- EM-ALBERT: The first ALBERT model available for Manipuri language which is trained on 1,034,715 Manipuri sentences.
- LaBSE: Encoder models suitable for sentence retrieval tasks supporting 109 languages (including all major Indic languages) [paper].
- LASER3: Encoder models suitable for sentence retrieval tasks supporting 200 languages (including 27 Indic languges).
Multilingual Word Embeddings
Morphanalyzers
- AI4Bharat IndicNLP Project: Unsupervised morphanalyzers for 10 Indian languages learnt using morfessor.
Translation Models
- IndicTrans: Multilingual neural translation models for translation between English and 11 Indian languages. Supports translation between Indian langauges as well. A total of 110 translation directions are supported.
- Shata-Anuvaadak: SMT for 110 language pairs (all pairs between English and 10 Indian languages.
- LTRC Vanee: Dependency based Statistical MT system from English to Hindi.
- NLLB-200: Models for 200 languages including 27 Indic languages.
Transliteration Models
- AI4Bharat IndicXlit: A transformer-based multilingual transliteration model with 11M parameters for Roman to native script conversion and vice versa that supports 21 Indic languages. Described in this paper.
Speech Models
- AI4Bharat IndicWav2Vec: Multilingual pre-trained models for 40 Indian languages based on Wav2Vec 2.0.
- Vakyansh CLSRIL-23: Pretrained wav2vec2 model trained on 10,000 hours of Speech data in 23 Indic Languages (documentation) (experimentation platform).
- arijitx/wav2vec2-large-xlsr-bengali: Pretrained wav2vec2-large-xlsr trained on ~50 hrs(40,000 utterances) of OpenSLR Bengali data. Test WER 32.45% without LM.
NER
- AI4Bharat IndicNER: NER model for 11 Indic languages.
- AsNER: A Baseline Assamese NER model.
- L3Cube-MahaNER-BERT: A 752 million token multilingual BERT model. Described in this paper.
- CFILT HiNER: Hindi NER models trained on CFILT HiNER dataset. Described in this paper.
Speech Corpora
- Microsoft Speech Corpus: Speech corpus for Telugu, Tamil and Gujarati.
- Microsoft-IITB Marathi Speech Corpus: 109 hours of speech data collected via crowdsourcing.
- AccentDB: Database of Indian English accents from native speakers in Bangla, Malayalam, Telugu and Oriya.
- IIT Madras TTS database
- BABEL Speech Corpus: includes some Indian languages
- WikiPron: Words and their pronunciations in IPA mined from Wiktionary. Includes Indian languages. paper
- CVIT IndicSpeech: TTS data for 3 Indian languages: Malayalam, Bengali and Hindi (24 hours each).
- Google Speech Corpus: TTS data for 6 Indian languages: Malayalam, Marathi, Telugu, Kannada, Gujarati, Tamil (upto 9 hours each). Resources SLR#63-#66, #78-#79. (paper)
- CoVoST 2: Tamil 2 hrs data
- SMC Malayalam Speech Corpus - Download link
- Vāksañcayaḥ Sanskrit Speech Corpus : 78 hours of speech corpus in Sanskrit prose, with a speaker disjoint splits of train, dev and test. It also contains an additional out of domain test data with speakers having pronunciation influences from L1 (paper).
- IISc-MILE Kannada ASR Corpus: Transcribed speech corpus containing ~350 hours of read speech data for training ASR systems for Kannada language. Described in this paper.
- IISc-MILE Tamil ASR Corpus: Transcribed speech corpus containing ~150 hours of read speech data for training ASR systems for Tamil language. Described in this paper.
- MUCS 2021 Dataset: (Gujarati, Hindi, Marathi, Odia, Tamil, Telugu) Multilingual and Code-Switching ASR Challenges for Low Resource Indian Languages
- Gramvaani: 100 hours of labelled data and 1000 hours of pretraining data for Hindi
- Kashmiri Data Corpus: Collection of transcribed Kashmiri recordings taken from native speakers
- Hindi-Tamil-English ASR Challenge: 490 hours of transcribed speeech data in three Indian Languages
- Large Sinhala ASR training data set: Sinhala ASR training data set containing ~185K utterances
- Large Bengali ASR training data set: Bengali ASR training data set containing ~196K utterances
- Large Nepali ASR training data set: Nepali ASR training data set containing ~157K utterances
- Crowdsourced high-quality Gujarati multi-speaker speech data set: Contains recordings of native speakers of Gujarati
- Crowdsourced high-quality Kannada multi-speaker speech data set: Contains recordings of native speakers of Kannada
- Crowdsourced high-quality Malayalam multi-speaker speech data set: Contains recordings of native speakers of Malayalam
- Crowdsourced high-quality Marathi multi-speaker speech data set: Contains recordings of native speakers of Marathi
- Crowdsourced high-quality Tamil multi-speaker speech data set: Contains recordings of native speakers of Tamil
- Crowdsourced high-quality Telugu multi-speaker speech data set: Contains recordings of native speakers of Telugu
- Nepali National corpus: The Nepali Spoken Corpus contains audio recordings from different 17 types of social activities with a total temporal recording duration of 31 hours and 26 minutes. Described here.
- Shrutilipi: Over 6400 hours of transcribed speech corpus across 12 Indian languages: Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Odia, Punjabi, Sanskrit, Tamil, Telugu, Urdu
OCR Corpora
Multimodal Corpora
- English-Hindi Visual Genome: Images captioned in both English and Hindi.
- English-Hindi Flickr 8k: A subset of images from Flickr8k images captioned by native speakers in both English and Hindi. Code and data available here.
Language Specific Catalogs
Pointers to language-specific NLP resource catalogs