indicnlp_catalog

A collaborative catalog of NLP resources for Indic languages

View on GitHub

:bookmark: The Indic NLP Catalog

A Collaborative Catalog of Resources for Indic Language NLP

The Indic NLP Catalog repository is an attempt to collaboratively build the most comprehensive catalog of NLP datasets, models and other resources for all languages of the Indian subcontinent.

Please suggest any other resources you may be aware of. Raise a pull request or an issue to add more resources to the catalog. Put the proposed entry in the following format:

[Wikipedia Dumps](https://dumps.wikimedia.org/)

Add a small, informative description of the dataset and provide links to any paper/article/site documenting the resource. Mention your name too. We would like to acknowlege your contribution to building this catalog in the CONTRIBUTORS list.

Indian language NLP has come a long way. We feature a few resources that are illustrative of the trends in recent times along various axes and point to a bright future.

Browse the entire catalog…

:raising_hand:Note: Many known resources have not yet been classified into the catalog. They can be found as open issues in the repo.

Major Indic Language NLP Repositories

Libraries and Tools

Evaluation Benchmarks

Benchmarks spanning multiple tasks.

Standards

Text Corpora

Monolingual Corpus

Language Identification

Lexical Resources and Semantic Similarity

NER Corpora

Parallel Translation Corpus

MT Evaluation

Parallel Transliteration Corpus

Text Classification

Textual Entailment/Natural Language Inference

Paraphrase

Sentiment, Sarcasm, Emotion Analysis

Hate Speech and Offensive Comments

Question Answering

Dialog

Discourse

Information Extraction

POS Tagged corpus

Chunk Corpus

Dependency Parse Corpus

Coreference Corpus

Summarization

Data to Text

Models

Language Identification

Word Embeddings

Pre-trained Language Models

Multilingual Word Embeddings

Morphanalyzers

Translation Models

Transliteration Models

Speech Models

NER

Speech Corpora

OCR Corpora

Multimodal Corpora

Language Specific Catalogs

Pointers to language-specific NLP resource catalogs