Sumanth Doddapaneni

Hello | నమస్కారం | नमस्ते

I'm Sumanth Doddapaneni, a ML Researcher at Sarvam AI and a fourth-year Ph.D. student (currently on leave) in the Department of Computer Science at IIT Madras. I am advised by Mitesh M. Khapra and Anoop Kunchukuttan . My research focuses on multilingual language modeling, machine translation, and automatic speech recognition, with a strong emphasis on low-resource Indian languages. I’m fortunate to be a Google PhD Fellow (2023) and have received Outstanding Paper Awards at EMNLP 2024 and ACL 2024 for my work on auto evaluation and multilingual dataset creation.

Previously, I interned at Google Research in Bangalore, working on improving multilingual generation, working with from Nitish Gupta and Partha Talukdar. I also spent a summer at Google Research in Mountain View, focusing on language model personalization, working with Krishna Sayana. I collaborated with Rahul Aralikatte at Mila - Quebec AI Institute on multilingual summarization.

Some of my best contributions include: CIA, FBI, IndicTrans: v3, v2, and v1, IndicBERT and IndicWav2Vec

If you wanna chat about research/academia/whatever, feel free to reach out to sumanth@sarvam.ai

news

Jul, 2025	Attending ACL 2025, Vienna, Austria 🇦🇹 to present CIA work! Let's catch up if you are there!
June, 2025	Excited for the release of Sarvam-Translate now supporting 22 Indian languages and structured long-form text. The model is openly available for everyone to use on HF and on Sarvam APIs!
May, 2025	CIA: Cross-Lingual Auto Evaluation for Assessing Multilingual LLMs paper has been accepted at ACL 2025!
Apr, 2025	Joined Sarvam AI as a ML Researcher. Building the best models for India and beyond! 🚀
Mar, 2025	Happy to release the beta version of IndicTrans3. Try it out and let us know your feedback! 🚀
Mar, 2025	Attending Advanced Language Processing School (ALPS) 2025 at the Centre CNRS Paul Langevin, Aussois, France! 🇫🇷
Jan, 2025	Talk at Microsoft on "Rethinking Evaluator LLMs With a Cross-Lingual Twist". Thanks Gagan Madan for hosting me!
Jan, 2025	Talk at Google DeepMind, Bangalore on "Rethinking Evaluator LLMs With a Cross-Lingual Twist". Thanks Harman Singh for hosting me!
Jan, 2025	I'll be at the Google DeepMind Research Symposium on Jan 27 & 28, 2025. Let's catch up if you are there!
Nov, 2024	🏆 Delighted to share that our work FBI has received the Outstanding Paper Award at EMNLP 2024!
Nov, 2024	Talk at Google Research, Mountain View on "Rethinking Evaluator LLMs With a Cross-Lingual Twist". Thanks Krishna Sayana for hosting me!
Nov, 2024	Talk at Language Technologies Institute (LTI) @ Carnegie Mellon University (CMU) on "Rethinking Evaluator LLMs With a Cross-Lingual Twist". Thanks Simran Khanuja for hosting me!
Nov, 2024	Attending EMNLP 2024, Miami, USA 🇺🇸 to present FBI work! Let's catch up if you are there!
Oct, 2024	Talk at IT University of Copenhagen on "Rethinking Evaluator LLMs With a Cross-Lingual Twist". Thanks Ratish Puduppully for hosting me!
Oct, 2024	Our pre-print CIA: Cross-Lingual Auto Evaluation for Assessing Multilingual LLMs is out on arxiv!
Sep, 2024	FBI paper has been accepted at EMNLP 2024!
Sep, 2024	After almost 3 years of reviewing our survey paper on multilingual language models(pre-gpt era) paper has been accepted to ACM Computing Surveys!
Aug, 2024	🏆 Delighted to share that our work IndicLLMSuite has received the Outstanding Paper Award at ACL 2024!
Aug, 2024	Attending ACL 2024, Bangkok, Thailand 🇹🇭! Let's catch up if you are there!
June, 2024	Our pre-print FBI: Finding Blindspots in LLM Evaluations with Interpretable Checklists is out on arxiv!
May, 2024	IndicLLMSuite has been accepted at ACL 2024!
Feb, 2024	I'll be at the Google Research Week, between Feb 1-3, 2024. Let's catch up if you are there!
Dec, 2023	Attending EMNLP 2023, Singapore 🇸🇬! Let's catch up if you are there!
Nov, 2023	My research is now funded by Google PhD Fellowship. Thank You, Google!
Nov, 2023	Will start as a Research Intern at Google Research, India 🇮🇳! Will be in Bangalore till March'24. Let's catch up if you are here!
Oct, 2023	Will be traveling along the East Coast during the last 2 weeks of October, Boston (Oct 22-24), New York (Oct 24-28), Pittsburg (Oct 28-31) and back in the Bay Area till Nov 4. Come say Hi, and let's walk around the streets, eat good food and maybe talk NLP!
July, 2023	Will be attending ACL 2023 in Toronto, Canada 🇨🇦! I will be presenting IndicXTREME, Naamapadam, and Vārta. Let's catch up if you are here!
June, 2023	Started as a Student Researcher at Google Research, Mountain View 🇺🇸! Will be in the Bay Area till November. Let's catch up if you are here!
May, 2023	Released IndicTrans2, this is the first model to support all 22 Scheduled Indian languages. More details in the Paper. Kudos to entire AI4Bharat Team for pulling off this herculean effort.
May, 2023	Released a pre-print "A Comprehensive Analysis of Adapter Efficiency". Paper available here. Kudos to Nandini Mundra for driving this work!
May, 2023	Three papers accepted at ACL 2023. Pre-prints - IndicXTREME, Vārta, Naamapadam
Apr, 2023	Our paper A Survey of Adversarial Defences and Robustness in NLP is accepted at ACM Computing Surveys.
Feb, 2023	Talk at Google Research India on Building Natural Language Understanding (NLU) capabilities for Indic languages. Thanks Nitish Gupta and Partha Talukdar for hosting me!
Feb, 2023	Our paper Effectiveness of Mining Audio and Text Pairs from Public Data for Improving ASR Systems for Low-Resource Languages is accepted at ICASSP 2023.
Jan, 2023	I'll be attending Google Research Week 2023! Let's catch up if you are there!
Dec, 2022	Released Naamapadam. released. Paper is available here.
Dec, 2022	Released IndicXTREME and IndicBERT v2. released. Paper is available here.
Dec, 2022	Attending EMNLP 2022, Abu Dhabi 🇦🇪! Let's catch up if you are there!
Nov, 2022	I'll be attending ALPS 2023! Let's catch up if you are there!
Sept, 2022	Relased a pre-print of our paper Effectiveness of Mining Audio and Text Pairs from Public Data for Improving ASR Systems for Low-Resource Languages. Work led by Kaushal Bhogale
May, 2022	Presenting Samanantar at ACL 2022, Dublin 🇮🇪 (Thank You, Prof. Mitesh Khapra). In-person talk (25/05, session 7) & Poster (25/05, session 6). Come say Hi, and let's talk NMT
Feb, 2022	Presenting IndicWav2Vec at AAAI 2022.
Feb, 2022	I'll be attending the Google Research Week 2022! Feel free to get in touch if you are attending the same.

selected papers

ACL'25

Cross-Lingual Auto Evaluation for Assessing Multilingual LLMs

Sumanth Doddapaneni, Mohammed Safi Ur Rahman Khan, Dilip Venkatesh, Raj Dabre, Anoop Kunchukuttan, Mitesh M. Khapra

Abstract

Evaluating machine-generated text remains a significant challenge in NLP, especially for non-English languages. Current methodologies, including automated metrics, human assessments, and LLM-based evaluations, predominantly focus on English, revealing a significant gap in multilingual evaluation frameworks. We introduce the Cross Lingual Auto Evaluation (CIA) Suite, an extensible framework that includes evaluator LLMs (Hercule) and a novel test set (Recon) specifically designed for multilingual evaluation. Our test set features 500 human-annotated instructions spanning various task capabilities along with human judgment scores across six languages. This would enable benchmarking of general-purpose multilingual LLMs and facilitate meta-evaluation of Evaluator LLMs. The proposed model, Hercule, is a cross-lingual evaluation model that addresses the scarcity of reference answers in the target language by learning to assign scores to responses based on easily available reference answers in English. Our experiments demonstrate that Hercule aligns more closely with human judgments compared to proprietary models, demonstrating the effectiveness of such cross-lingual evaluation in low resource scenarios. Further, it is also effective in zero-shot evaluation on unseen languages. This study is the first comprehensive examination of cross-lingual evaluation using LLMs, presenting a scalable and effective approach for multilingual assessment. All code, datasets, and models will be publicly available to enable further research in this important area.
EMNLP'24

Finding Blindspots in LLM Evaluations with Interpretable Checklists

Sumanth Doddapaneni, Mohammed Safi Ur Rahman Khan, Sshubam Verma, Mitesh M. Khapra

Abstract | 🏆 Outstanding Paper

Large Language Models (LLMs) are increasingly relied upon to evaluate text outputs of other LLMs, thereby influencing leaderboards and development decisions. However, concerns persist over the accuracy of these assessments and the potential for misleading conclusions. In this work, we investigate the effectiveness of LLMs as evaluators for text generation tasks. We propose FBI, a novel framework designed to examine the proficiency of Evaluator LLMs in assessing four critical abilities in other LLMs: factual accuracy, instruction following, coherence in long-form writing, and reasoning proficiency. By introducing targeted perturbations in answers generated by LLMs, that clearly impact one of these key capabilities, we test whether an Evaluator LLM can detect these quality drops. By creating a total of 2400 perturbed answers covering 22 perturbation categories, we conduct a comprehensive study using different evaluation strategies on five prominent LLMs commonly used as evaluators in the literature. Our findings reveal significant shortcomings in current Evaluator LLMs, which failed to identify quality drops in over 50\% of cases on average. Single-answer and pairwise evaluations demonstrated notable limitations, whereas reference-based evaluations showed comparatively better performance. These results underscore the unreliable nature of current Evaluator LLMs and advocate for cautious implementation in practical applications. Code and data are available at this https://github.com/AI4Bharat/FBI.
ACL'24

IndicLLMSuite: A Blueprint for Creating Pre-training and Fine-Tuning Datasets for Indian Languages

Mohammed Safi Ur Rahman Khan, Priyam Mehta, Ananth Sankar, Umashankar Kumaravelan, Sumanth Doddapaneni, Suriyaprasaad G, Varun Balan G, Sparsh Jain, Anoop Kunchukuttan, Pratyush Kumar, Raj Dabre, Mitesh M. Khapra

Abstract | 🏆 Outstanding Paper

Despite the considerable advancements in English LLMs, the progress in building comparable models for other languages has been hindered due to the scarcity of tailored resources. Our work aims to bridge this divide by introducing an expansive suite of resources specifically designed for the development of Indic LLMs, covering 22 languages, containing a total of 251B tokens and 74.8M instruction-response pairs. Recognizing the importance of both data quality and quantity, our approach combines highly curated manually verified data, unverified yet valuable data, and synthetic data. We build a clean, open-source pipeline for curating pre-training data from diverse sources, including websites, PDFs, and videos, incorporating best practices for crawling, cleaning, flagging, and deduplication. For instruction-fine tuning, we amalgamate existing Indic datasets, translate/transliterate English datasets into Indian languages, and utilize LLaMa2 and Mixtral models to create conversations grounded in articles from Indian Wikipedia and Wikihow. Additionally, we address toxicity alignment by generating toxic prompts for multiple scenarios and then generate non-toxic responses by feeding these toxic prompts to an aligned LLaMa2 model. We hope that the datasets, tools, and resources released as a part of this work will not only propel the research and development of Indic LLMs but also establish an open-source blueprint for extending such efforts to other languages. The data and other artifacts created as part of this work are released with permissive licenses.
TMLR

IndicTrans2: Towards High-Quality and Accessible Machine Translation Models for all 22 Scheduled Indian Languages

AI4Bharat, Jay Gala, Pranjal A. Chitale, Raghavan AK, Sumanth Doddapaneni, Varun Gumma, Aswanth Kumar, Janki Nawale, Anupama Sujatha, Ratish Puduppully, Vivek Raghavan, Pratyush Kumar, Mitesh M. Khapra, Raj Dabre, Anoop Kunchukuttan

Abstract

India has a rich linguistic landscape with languages from 4 major language families spoken by over a billion people. 22 of these languages are listed in the Constitution of India (referred to as scheduled languages) are the focus of this work. Given the linguistic diversity, high-quality and accessible Machine Translation (MT) systems are essential in a country like India. Prior to this work, there was (i) no parallel training data spanning all the 22 languages, (ii) no robust benchmarks covering all these languages and containing content relevant to India, and (iii) no existing translation models which support all the 22 scheduled languages of India. In this work, we aim to address this gap by focusing on the missing pieces required for enabling wide, easy, and open access to good machine translation systems for all 22 scheduled Indian languages. We identify four key areas of improvement: curating and creating larger training datasets, creating diverse and high-quality benchmarks, training multilingual models, and releasing models with open access. Our first contribution is the release of the Bharat Parallel Corpus Collection (BPCC), the largest publicly available parallel corpora for Indic languages. BPCC contains a total of 230M bitext pairs, of which a total of 126M were newly added, including 644K manually translated sentence pairs created as part of this work. Our second contribution is the release of the first n-way parallel benchmark covering all 22 Indian languages, featuring diverse domains, Indian-origin content, and source-original test sets. Next, we present IndicTrans2, the first model to support all 22 languages, surpassing existing models on multiple existing and new benchmarks created as a part of this work. Lastly, to promote accessibility and collaboration, we release our models and associated data with permissive licenses at https://github.com/ai4bharat/IndicTrans2.
ACL'23

Towards Leaving No Indic Language Behind: Building Monolingual Corpora, Benchmark and Models for Indic Languages

Sumanth Doddapaneni, Rahul Aralikatte, Gowtham Ramesh, Shreya Goyal, Mitesh M. Khapra, Anoop Kunchukuttan, Pratyush Kumar

Abstract

Building Natural Language Understanding (NLU) capabilities for Indic languages, which have a collective speaker base of more than one billion speakers is absolutely crucial. In this work, we aim to improve the NLU capabilities of Indic languages by making contributions along 3 important axes (i) monolingual corpora (ii) NLU testsets (iii) multilingual LLMs focusing on Indic languages. Specifically, we curate the largest monolingual corpora, IndicCorp, with 20.9B tokens covering 24 languages from 4 language families - a 2.3x increase over prior work, while supporting 12 additional languages. Next, we create a human-supervised benchmark, IndicXTREME, consisting of nine diverse NLU tasks covering 20 languages. Across languages and tasks, IndicXTREME contains a total of 105 evaluation sets, of which 52 are new contributions to the literature. To the best of our knowledge, this is the first effort towards creating a standard benchmark for Indic languages that aims to test the multilingual zero-shot capabilities of pretrained language models. Finally, we train IndicBERT v2, a state-of-the-art model supporting all the languages. Averaged across languages and tasks, the model achieves an absolute improvement of 2 points over a strong baseline. The data and models are available at \url{https://github.com/AI4Bharat/IndicBERT}.
TACL

Samanantar: The Largest Publicly Available Parallel Corpora Collection For 11 Indic Languages

Gowtham Ramesh*, Sumanth Doddapaneni*, Aravinth Bheemaraj, Mayank Jobanputra, Raghavan AK, Ajitesh Sharma, Sujit Sahoo, Harshita Diddee, Divyanshu Kakwani, Navneet Kumar, Aswin Pradeep, Srihari Nagaraj, Kumar Deepak, Vivek Raghavan, Anoop Kunchukuttan, Pratyush Kumar, Mitesh Shantadevi Khapra

Abstract

We present Samanantar, the largest publicly available parallel corpora collection for Indic languages. The collection contains a total of 49.7 million sentence pairs between English and 11 Indic languages (from two language families). Specifically, we compile 12.4 million sentence pairs from existing, publicly available parallel corpora, and additionally mine 37.4 million sentence pairs from the Web, resulting in a 4× increase. We mine the parallel sentences from the Web by combining many corpora, tools, and methods: (a) Web-crawled monolingual corpora, (b) document OCR for extracting sentences from scanned documents, (c) multilingual representation models for aligning sentences, and (d) approximate nearest neighbor search for searching in a large collection of sentences. Human evaluation of samples from the newly mined corpora validate the high quality of the parallel sentences across 11 languages. Further, we extract 83.4 million sentence pairs between all 55 Indic language pairs from the English-centric parallel corpus using English as the pivot language. We trained multilingual NMT models spanning all these languages on Samanantar which outperform existing models and baselines on publicly available benchmarks, such as FLORES, establishing the utility of Samanantar. Our data and models are available publicly at Samanantar and we hope they will help advance research in NMT and multilingual NLP for Indic languages.
AAAI'22

Towards Building ASR Systems For The Next Billion Users

Tahir Javed, Sumanth Doddapaneni, Abhigyan Raman, Kaushal Santosh Bhogale, Gowtham Ramesh, Anoop Kunchukuttan, Pratyush Kumar, Mitesh M. Khapra

Abstract

Recent methods in speech and language technology pretrain very large models which are fine-tuned for specific tasks. However, the benefits of such large models are often limited to a few resource rich languages of the world. In this work, we make multiple contributions towards building ASR systems for low resource languages from the Indian subcontinent. First, we curate 17,000 hours of raw speech data for 40 Indian languages from a wide variety of domains including education, news, technology, and finance. Second, using this raw speech data we pretrain several variants of wav2vec style models for 40 Indian languages. Third, we analyze the pretrained models to find key features: codebook vectors of similar sounding phonemes are shared across languages, representations across layers are discriminative of the language family, and attention heads often pay attention within small local windows. Fourth, we fine-tune this model for downstream ASR for 9 languages and obtain state-of-the-art results on 3 public datasets, including on very low-resource languages such as Sinhala and Nepali. Our work establishes that multilingual pretraining is an effective strategy for building ASR systems for the linguistically diverse speakers of the Indian subcontinent.

flags

Conferences and internships took me here 🇮🇪 🇦🇪 🇺🇸 🇨🇦 🇸🇬 🇹🇭 🇫🇷 🇦🇹 (so far)!