All roles

Audio LLM Researcher

Research and build audio understanding models and realtime audio LLM systems for natural, low-latency, human-like conversation.

About the Role

  • You will work on next-generation audio understanding models that power natural spoken interaction, including speech understanding, paralinguistic understanding, conversational reasoning, and multimodal alignment.
  • This role sits close to both research and product. We care about frontier model quality, but we also care about whether those models can survive real conversational conditions in production.
  • Candidates with hands-on experience in end-to-end audio LLM systems or full-duplex realtime interaction are especially preferred.

Responsibilities

  • Research and develop audio understanding models for tasks such as speech semantics, acoustic event understanding, dialogue state tracking, speaker behavior understanding, and emotion-aware interaction.
  • Design, train, and evaluate end-to-end audio LLM systems that take raw or lightly processed audio as input and produce robust conversational outputs.
  • Improve models for realtime deployment, including streaming inference, latency control, interruption handling, barge-in, and full-duplex conversational behavior.
  • Build evaluation protocols and error taxonomies for challenging conversational scenarios, then turn findings into concrete data and model improvements.
  • Work with product and engineering to translate model advances into user-facing conversational quality gains.

Requirements

  • Strong foundation in machine learning and deep learning, with hands-on experience training and analyzing large neural models.
  • Solid coding skills in Python and strong familiarity with PyTorch and modern research tooling.
  • Experience in one or more of the following areas: speech recognition, spoken language understanding, audio-language modeling, multimodal learning, dialogue systems, or realtime interaction systems.
  • Ability to design rigorous experiments, interpret failure modes clearly, and iterate quickly from data to model to evaluation.
  • Strong communication and collaboration skills, with the ability to work across research, product, and infrastructure teams.

Preferred Qualifications

  • Hands-on experience with end-to-end audio LLMs, speech-language models, or other architectures that directly model audio-to-text or audio-to-action behavior.
  • Experience building or researching full-duplex spoken agents, realtime voice assistants, or low-latency conversational systems.
  • Experience building audio datasets, running data pipelines, or designing annotation standards and quality-control workflows for speech or conversational data.
  • Published papers, open-source releases, technical reports, or other public work in relevant areas are strongly preferred.
  • Experience with multilingual or cross-domain audio understanding is a plus.

How to Apply

  • Send your resume together with selected papers, technical reports, demos, or repository links that best represent your work.
  • If you have worked on audio data pipelines or annotation systems, include a short note on the scale, quality bar, and your contribution.
  • For projects involving realtime systems, conversational agents, or audio LLMs, concrete latency and product metrics are highly valued.
  • Take-home tasks, if any, will be paid at a reasonable market rate.