Corpus Creation & Collection
A corpus is a collection of large scale linguistic data, compiled as different formats of written texts. Corpuses are used for many different purposes, ranging from learning language mechanics to development of computer technology.

Whether you are a university trying to train your students in various aspects of a language, a research scholar with an interest in language mechanics and study of contemporary language and changes occurring in it over time, or a software company trying to create a text-to-speech or translation engine, Webdunia will have a corpus that fits the bill.

Webdunia specializes in Corpus data creation and customization. We possess a huge collection of multilingual data sets in various Indian languages which we can customize and provide as per the client need. Data can be customized on sentence to sentence basis or on topic by topic article format, as you desire. Your wishes in this field may range from lighter topics such as sports and recreation to the more serious international diplomacy and we can deliver them all.

Webdunia specializes in providing Corpus of various types:
Word Corpus: Used for testing tools like spellers, thesauri etc. to check their coverage. Webdunia works with all Indian languages and thus has large amounts of data in all of them. This data is used to find unique words and provide a proof-read word Corpus.
Parallel Sentences Corpus: One important way of understanding language structure and testing machine translation abilities by comparing English sentence with the same sentence translated to some language. Webdunia posses hundreds of thousands of parallel sentences in English and other languages which range from simple to complex sentences and from various fields like sports, international affairs, cinema and theatre, literature etc.
Paragraphs: Another application of language parallel Corpus where a paragraph to paragraph translation approach is used. Sentences may not be equivalent in both languages but as a whole, the paragraphs in both languages contain the same information.