Audio LLM Researcher

Research and build audio understanding models and realtime audio LLM systems for natural, low-latency, human-like conversation.

About the Role

You will work on next-generation audio understanding models that power natural spoken interaction, including speech understanding, paralinguistic understanding, conversational reasoning, and multimodal alignment.
This role sits close to both research and product. We care about frontier model quality, but we also care about whether those models can survive real conversational conditions in production.
Candidates with hands-on experience in end-to-end audio LLM systems or full-duplex realtime interaction are especially preferred.

Responsibilities

Research and develop audio understanding models for tasks such as speech semantics, acoustic event understanding, dialogue state tracking, speaker behavior understanding, and emotion-aware interaction.
Design, train, and evaluate end-to-end audio LLM systems that take raw or lightly processed audio as input and produce robust conversational outputs.
Improve models for realtime deployment, including streaming inference, latency control, interruption handling, barge-in, and full-duplex conversational behavior.
Build evaluation protocols and error taxonomies for challenging conversational scenarios, then turn findings into concrete data and model improvements.
Work with product and engineering to translate model advances into user-facing conversational quality gains.

Requirements

Strong foundation in machine learning and deep learning, with hands-on experience training and analyzing large neural models.
Solid coding skills in Python and strong familiarity with PyTorch and modern research tooling.
Experience in one or more of the following areas: speech recognition, spoken language understanding, audio-language modeling, multimodal learning, dialogue systems, or realtime interaction systems.
Ability to design rigorous experiments, interpret failure modes clearly, and iterate quickly from data to model to evaluation.
Strong communication and collaboration skills, with the ability to work across research, product, and infrastructure teams.

Preferred Qualifications

Hands-on experience with end-to-end audio LLMs, speech-language models, or other architectures that directly model audio-to-text or audio-to-action behavior.
Experience building or researching full-duplex spoken agents, realtime voice assistants, or low-latency conversational systems.
Experience building audio datasets, running data pipelines, or designing annotation standards and quality-control workflows for speech or conversational data.
Published papers, open-source releases, technical reports, or other public work in relevant areas are strongly preferred.
Experience with multilingual or cross-domain audio understanding is a plus.

How to Apply

Send your resume together with selected papers, technical reports, demos, or repository links that best represent your work.
If you have worked on audio data pipelines or annotation systems, include a short note on the scale, quality bar, and your contribution.
For projects involving realtime systems, conversational agents, or audio LLMs, concrete latency and product metrics are highly valued.
Take-home tasks, if any, will be paid at a reasonable market rate.

Explore More

Other Open Roles

View all roles

Brand and Creative Designer

Own visual direction and creative execution across campaigns, launches, social media, product visuals, and the long-term brand system.

View role

Product Designer

Design end-to-end AI product experiences for voice, video, and image products, turning model capabilities into simple, useful, human-feeling workflows.

View role

AI Systems Engineer

Design and optimize high-performance algorithm engines for in-house large model development and real-world AI applications.

View role

Full-Stack Engineer

Own the full path from database design, APIs, and web frontends to deployment, directly taking AI products from zero to one.

View role