Language Resources and Tools for Turkic Languages
Turkish Natural Language Processing Tools:
ITU Turkish Natural Language Processing Pipeline prepared by ITU NLP Group:
http://tools.nlp.itu.edu.tr/
Turkish Language Resources:
TURKISH TEXT DATASETS and TREEBANKS:
Prepared by ITU NLP Group:
Prepared by other researches and groups:
- TS Corpus: The Turkish Corpus Project contains over 491 million tagged tokens from Turkish social media posts to idioms and proverbs.
- Turkish National Corpus (TNC): The TNC is a large scale, general-purpose Turkish text corpus. The corpus is comprised of 50 million words in contemporary Turkish.
- Bilkent Turkish Writings Dataset: This dataset contains content from Turkish creative writing courses between 2014-2018. All in all, there are nearly 7,000 texts available for download in CSV format.
- Sentiment Lexicons for 81 Languages: This dataset contains both positive and negative sentiment dictionaries for 81 languages, including Turkish.
- Old Newspapers:This corpus contains natural language text from various newspapers, social media posts and blog pages in multiple languages. Overall, the corpus contains nearly 17 million sentences in 67 languages, including Turkish.
- English/Turkish Wikipedia Named-Entity Recognition and Text Categorization Dataset: These two datasets are comprised of automatically categorized and annotated sentences taken from the Turkish and English Wikipedia. They were created for named entity recognition and text categorization respectively
- Turkish UD: This is a work-in-progress overview of the UD annotation for Turkish.
- Turkish WordNet: You can reach Turkish WordNet from this link.
- trTenTen20: Corpus of the Turkish Web 2020: is a Turkish corpus made up of texts collected from the Internet. The corpus contains 4.9 billion words.
- trWaC – Turkish corpus from the web: is a Turkish corpus made up of texts collected from the Internet. The corpus consists of 32 million words.
- CHILDES: corpora are comparable corpora made up from transcripts of child language. Current CHILDES corpora include 24 languages.
- ParlaMint 2.1: This dataset contains parliamentary debate transcripts for 17 different languages, including Turkish.
TURKISH LANGUAGE MODELS:
- ELMo: You can reach Turkish pre-trained ELMo language model from this link.
- BERT: You can reach Turkish pre-trained BERT language model from this link.
- Fasttext: You can reach Turkish pre-trained Fasttext language model from this link.
TURKISH PARALLEL CORPORA:
- The English-Swedish-Turkish Corpus: This corpus consists of original texts and their translations from Turkish to Swedish and English. As such, it’s organized so that texts, paragraphs, sentences, and words are in line with each other.
- Bianet Corpus: This corpus contains over 3,000 Turkish articles with their sentence-aligned Kurdish or English translations. All content comes from the Bianet online newspaper archives.
- OPUS Parallel Corpora: This set of text corpora contains aligned sentences in 40 languages, including Turkish. As a result, users can check translation sentence pairs for many languages.
- OpenSubtitles Parallel Corpora 2018: This dataset is a parallel corpus collection of translated movie subtitles in 58 different languages, including Turkish.
TURKISH SPEECH DATASETS:
Azerbaijani Language Resources:
- en-az-parallel-corpus: The directory contains an information about a parallel corpus for English-Azerbaijani and Azerbaijani-English translation tasks.
- az-corpus-nlp: Dataset Materials , NLP for Azerbaijan language.
- azWaC: Azerbaijani corpus from the web: The Azerbaijani Web Corpus (azWaC) is an Azeri corpus made up of texts collected from the Internet. The total size of the corpus is total size 94 million words.
- University of Leipzig corpus collection: This corpora provides datasets in different languages using the same format and comparable including Azerbaijani language. The data sets consists of Newscrawl (2011, 2013) and Wikipedia (misc) datasets in Azerbaijani language.
- Helsinki University corpus: New Testament in the Azerbaijani language.
- azwiki dump: The Azerbaijani language Wikipedia Dump can be downloaded directly
- Azeri at An Crúbadán: This corpus includes 8M+ words in Latin script.
- Domrachyov-Sudoplatova scraped corpus: This corpus includes 2189398 words and 100560 sentences avaliable for Azerbaijani language.
- AZ summarization: This corpus consists of articles and titles and it is available on request.
- Awesome Azeri NLP: A curated list of awesome Azerbaijani language processing software and some research papers stating language resources can be found here.
AZERBAIJANI LANGUAGE MODELS:
Kazakh Natural Language Processing Tools:
https://github.com/makazhan/kaznlp
Kazakh Language Resources:
- Almaty Corpus of Kazakh language (NCKL):At the moment the size of the corpus is more than 40 million word tokens. The texts of the corpus were marked by means of the automatic morphological analyzer, 86% of word forms of the corpus were parsed
- Open Source Kazakh Language Corpus:This corpus is implemented for Kazakh language from Wikipedia dump. A total of 21 million words were collected. With almost 600 thousand words of different derivations.
- Kazakh UD Treebank: This is a work-in-progress overview of the UD annotation for Kazakh.
- kkWaC: Kazakh corpus from the web: The Kazakh Web Corpus (kkWaC) is a Kazakh corpus made up of texts collected from the Internet. Total size of the corpus is 139 million words.
Kyrgyz Language Resources:
- Kyrgyz corpus from the web: The Kyrgyz Web Corpus (kkWaC) is a Kirghiz corpus made up of texts collected from the Internet. The total size of the corpus is 19 million words.
- Kyrgyz UD: is among upcoming UD languages.
Uzbek Language Resources:
- Uzbek Corpus: Uzbek community corpus based on material from 2017. It includes 663,119 sentences, 706,385 types and 9,256,001 tokens.
- uzWaC: Uzbek corpus from the web: The Uzbek Web Corpus (uzWaC) is an Uzbek corpus made up of texts collected from the Internet. Total size of the corpus is 18 million words.
Tatar Language Resources: - Corpus of Written Tatar: This corpus contains a Text Corpus of the modern Tatar language consisting of over 500 million word occurrences (>620 mln tokens).
- Tatar National Corpus: The volume of the Corpus is 180,000,000 tokens (by December, 2018). The Corpus contains texts of different styles and genres (fiction, media texts, official documents, educational and scientific literature, etc.).
- Tatar Belletristic Literature Corpus: The corpus includes prosaic and poetic works of Tatar authors, texts of particular folklore genres, as well as works translated from other languages into Tatar.
Turkmen Language Resources:
- tkWaC: Turkmen corpus from the web: The Turkmen Web Corpus (tkWaC) is a Turkmen corpus made up of texts collected from the Internet. The total size of the corpus is 2 million words.
Uyghur Language Resources:
- UyNeRel: Uyghur Named Entity Relation Corpus.
- Uyghur UD: This is a work-in-progress overview of the UD annotation for Uyghur.
Turkic Natural Language Processing Tools:
- Apertium Project: maintains a large number of morphological analysers for the Turkic languages. These analysers are used, e.g. in annotation of the Corpus of Written Tatar among other things. There are also two small, unreleased treebanks of Tuvan and Crimean Tatar available along with the analysers.
Please let us know if you want to include your datasets.