The Linguistic Data Consortium for Indian Languages Launches New Datasets to Boost AI Research

The Linguistic Data Consortium for Indian Languages (LDC-IL) recently held its 8th Project Advisory Committee meeting, where it announced the release of 16 new datasets in Indian languages. These datasets, aimed at supporting research in Artificial Intelligence (AI) and Machine Learning (ML), are expected to contribute to the development of new technologies in Indian languages, such as Automatic Speech Recognition and Live Voice Translation.

The datasets cover a wide range of Indian languages, including Hindi, Bengali, Tamil, Marathi, Kannada, Malayalam, Odia, Assamese, Konkani, Maithili, Urdu, and Nepali. Additionally, the LDC-IL introduced two datasets for Chhattisgarhi, a mother tongue often associated with Hindi. This move reflects the government’s commitment to promoting education and technology in all mother tongues of India, as recommended in the NEP-2020.

By offering these datasets, the LDC-IL aims to enhance research and development in Indian languages, benefiting both academia and industry. The applications developed based on these datasets are expected to contribute to the preservation and promotion of these languages.

The LDC-IL Data Distribution Portal now hosts a total of 57 datasets, covering 21 Indian languages. What sets these datasets apart is that they are not crowdsourced; instead, they are collected from verified sources and curated by language experts. This ensures their authenticity and reliability, making them valuable resources for training AI and ML models.

In conclusion, the launch of these new datasets by the Linguistic Data Consortium for Indian Languages represents a significant step towards advancing AI research in Indian languages. By providing researchers and developers with curated resources, the LDC-IL aims to foster innovation and contribute to the growth of AI and ML technologies in India.