Over the last decade or so, research in speech technologies has seen a rapid and successful shift towards exclusively data-driven techniques such as machine learning and deep learning methods. Over the years, experiments with well-resourced languages such as English have demonstrated the success of these systems given sufficient data for training the systems. However, barring a handful of languages, this technological revolution has escaped most of the languages (including the officially supported, scheduled languages) spoken in India. This could be gauged from the commercial support for very few Indian languages across different speech-based products - Amazon Alexa supports Hindi among seven other international languages; Google Home supports 13 languages, including Hindi, as the only Indian language; Microsoft supports Indian English, Hindi, Tamil, Telugu, Gujarati, and Marathi for its ASR systems - there is no support whatsoever for most of the other Indian languages, especially languages belonging to the Tibeto-Burman and Austro-Asiatic language families. One of the primary reasons behind this could be the non-availability of sufficient speech datasets for most Indian languages. This is even more so for the non-scheduled Indo-Aryan and Dravidian languages and even the scheduled languages from the Tibeto-Burman and Austro-Asiatic language families, largely spoken in Eastern and North-Eastern parts of India. The Speech Datasets and Models for Indian Languages (SpeeD-IL) project aims to build large-scale, diverse transcribed speech datasets and models for Indian languages to fill this gap.
Over the last decade or so, research in speech technologies has seen a rapid and successful shift towards exclusively data-driven techniques such as machine learning and deep learning methods. Over the years, experiments with well-resourced languages such as English have demonstrated the success of these systems given sufficient data for training the systems. However, barring a handful of languages, this technological revolution has escaped most of the languages (including the officially supported, scheduled languages) spoken in India. This could be gauged from the commercial support for very few Indian languages across different speech-based products - Amazon Alexa supports Hindi among seven other international languages; Google Home supports 13 languages, including Hindi, as the only Indian language; Microsoft supports Indian English, Hindi, Tamil, Telugu, Gujarati, and Marathi for its ASR systems - there is no support whatsoever for most of the other Indian languages, especially languages belonging to the Tibeto-Burman and Austro-Asiatic language families. One of the primary reasons behind this could be the non-availability of sufficient speech datasets for most Indian languages. This is even more so for the non-scheduled Indo-Aryan and Dravidian languages and even the scheduled languages from the Tibeto-Burman and Austro-Asiatic language families, largely spoken in Eastern and North-Eastern parts of India. The Speech Datasets and Models for Indian Languages (SpeeD-IL) project aims to build large-scale, diverse transcribed speech datasets and models for Indian languages to fill this gap.