Computational Speech Processing | Language documentation | Phonology
About Me
I am a PhD student in Linguistics specializing in computational speech processing.
I am interested in how Automatic Speech Recognition (ASR), spoken Keyword Search (KWS), and related speech and language processing technologies can benefit linguists engaged in fieldwork and documentation of low-resource languages.
Research Interests
Automatic Speech Recognition (ASR) for low-resource languages
ASR for codemixed and bilingual audio
Computational methods for aiding language documentation
My research focuses on computational methods in language documentation, particularly focusing on adapting foundational ASR architectures for processing fieldwork data.
I work with Tira, a Kordofanian language spoken in Sudan.
I am interested in how to improve automatic speech recognition on low-resourced languages, especially when code-mixed with high-resource languages, and how ASR methods can aid language documentation.
In particular, I am interested in applying ASR to bilingual audio.
Most fieldworkers interact with the language community they work with using a meta-language, such as English, French or Spanish, which results in a lot of fieldwork audio being bilingual or code-switched.
It's difficult enough to apply ASR to monolingual fieldwork data, bilingual fieldwork data is an extra challenge!
Another major hurdle for implementing ASR (or any NLP tool) into linguistic fieldwork is the availability of consistent data.
Fieldworkers necessarily work on languages that do not have pre-existing standards for lexicography, orthography, or even phonetic transcription.
Furthermore, morphologically rich languages (like Tira!) face a sparsity problem since many inflected forms of a word may be missing from a given dataset.
To overcome these challenges, I am currently looking into how spoken keyword search (KWS) can be applied to speed up annotation linguistic fieldwork audio without requiring any training or fine-tuning.
I am also interested in exploring how morphological parsers can boostrap high-quality datasets for training NLP algorithms on a fieldwork language with minimal manual annotation, a strategy which takes advantage of the grammatical insight a linguist can provide.
Publications
2025
Simmons, Mark and Patience Epps. 2025. Tonogenesis in the Naduhup family of northwest Amazonia. Diacrhonica.Forthcoming.
Simmons, Mark. 2025. Data augmentation for low-resource bilingual ASR from Tira linguistic elicitation using Whisper. Proceedings of the Eighth Workshop on the Use of Computational Methods in the Study of Endangered Languages, Honolulu, Hawai'i.
Kaldhol, Nina and Sharon Rose and Mark Simmons. 2025. Prosody of topic and focus in Tira. Cross-disciplinary approaches to Information Structure in Niger-Congo languages. (Contemporary African Linguistics). Berlin: Language Science Press.
2022
Wei-Jen Ko, Cutter Dalton, Mark Simmons, Eliza Fisher, Greg Durrett and Junyi Jessy Li. Discourse Comprehension: A Question Answering Framework to Represent Sentence Connections. Conference on Empirical Methods in Natural Language Processing (EMNLP), 2022
Presentations and talks
Simmons, Mark. Morphologically constrained metaphony in Tira. A talk presented at Phonetics and Phonology in Europe 2023 Satellite Workshop 'Metaphony' June 1, 2023
Rose, Sharon and Simmons, Mark. Focus, topic and prosody in Tira. A talk presented at the 6th African Linguistics School in Porto Novo, Bénin.
Simmons, Mark. Reconstructing word-final voicing in Nadëb. A poster presented at the Annual Meeting for the Linguistics Society of America, January 6-9, 2022
Teaching experience
Guest lecture on ASR for LIGN 167.
Teaching assistant for LIGN 167: Deep Learning for Natural Language Understanding at UCSD, Fall 2025.
Teaching assistant for LIGN 8: Languages of America at UCSD, Summer 2025.
Teaching assistant for LIGN 168: Computational speech processing at UCSD, Spring 2024.
Teaching assistant for LIGN 8: Languages of America at UCSD, Winter 2024.
Teaching assistant for LIGN 110: Phonetics at UCSD, Fall 2023.
Teaching assistant for LIGN 8: Languages of America at UCSD, Spring 2023.
Guest lecture on vowel harmony in Tira for LIGN 111.
Teaching assistant for LIGN 111: Phonology at UCSD, Winter 2023.
Teaching assistant for LIGN 110: Phonetics at UCSD, Fall 2022.
Languages and skills
Spanish
Portuguese
Python
LaTeX
Awards and grants
Brython-Davis Fellowship. Spring 2024 and Fall 2024
Cota Robles Fellowship. 2021-2022 and 2024-2025
George H. Mitchell Award. May 2021
Work experience
Undergraduate research assistant: Naduhup documentation team. Documentation and description of Nadëb language (Naduhup, Brazil) under Prof. Patience Epps, UT Austin 2018-2021
Undergraduate research assistant: Discourse Question Answering framework. Data annotation for discourse question answering research team under Prof. Jessy Li, UT Austin, Summer 2021
Graduate research assistant: Tira language documentation. Transcribing fieldwork interviews and ingressing morphological data for Tira language project. Summer 2022.
Graduate research assistant: Richard Montague archiving project.. Transcription of audio interviews, used ASR and speaker diarization to speed up transcription. Summer 2024.