All roles
Audio LLM Researcher
Research and build audio understanding models and realtime audio LLM systems for natural, low-latency, human-like conversation.
About the Role
- You will work on next-generation audio understanding models that power natural spoken interaction, including speech understanding, paralinguistic understanding, conversational reasoning, and multimodal alignment.
- This role sits close to both research and product. We care about frontier model quality, but we also care about whether those models can survive real conversational conditions in production.
- Candidates with hands-on experience in end-to-end audio LLM systems or full-duplex realtime interaction are especially preferred.
Responsibilities
- Research and develop audio understanding models for tasks such as speech semantics, acoustic event understanding, dialogue state tracking, speaker behavior understanding, and emotion-aware interaction.
- Design, train, and evaluate end-to-end audio LLM systems that take raw or lightly processed audio as input and produce robust conversational outputs.
- Improve models for realtime deployment, including streaming inference, latency control, interruption handling, barge-in, and full-duplex conversational behavior.
- Build evaluation protocols and error taxonomies for challenging conversational scenarios, then turn findings into concrete data and model improvements.
- Work with product and engineering to translate model advances into user-facing conversational quality gains.
Requirements
- Strong foundation in machine learning and deep learning, with hands-on experience training and analyzing large neural models.
- Solid coding skills in Python and strong familiarity with PyTorch and modern research tooling.
- Experience in one or more of the following areas: speech recognition, spoken language understanding, audio-language modeling, multimodal learning, dialogue systems, or realtime interaction systems.
- Ability to design rigorous experiments, interpret failure modes clearly, and iterate quickly from data to model to evaluation.
- Strong communication and collaboration skills, with the ability to work across research, product, and infrastructure teams.
Preferred Qualifications
- Hands-on experience with end-to-end audio LLMs, speech-language models, or other architectures that directly model audio-to-text or audio-to-action behavior.
- Experience building or researching full-duplex spoken agents, realtime voice assistants, or low-latency conversational systems.
- Experience building audio datasets, running data pipelines, or designing annotation standards and quality-control workflows for speech or conversational data.
- Published papers, open-source releases, technical reports, or other public work in relevant areas are strongly preferred.
- Experience with multilingual or cross-domain audio understanding is a plus.
How to Apply
- Send your resume together with selected papers, technical reports, demos, or repository links that best represent your work.
- If you have worked on audio data pipelines or annotation systems, include a short note on the scale, quality bar, and your contribution.
- For projects involving realtime systems, conversational agents, or audio LLMs, concrete latency and product metrics are highly valued.
- Take-home tasks, if any, will be paid at a reasonable market rate.
Explore More
Other Open Roles
Brand and Creative Designer
Own visual direction and creative execution across campaigns, launches, social media, product visuals, and the long-term brand system.
View roleProduct Designer
Design end-to-end AI product experiences for voice, video, and image products, turning model capabilities into simple, useful, human-feeling workflows.
View roleAI Systems Engineer
Design and optimize high-performance algorithm engines for in-house large model development and real-world AI applications.
View roleFull-Stack Engineer
Own the full path from database design, APIs, and web frontends to deployment, directly taking AI products from zero to one.
View role