TL;DR

Lead Machine Learning Engineer (LLM Infrastructure): Build and maintain scalable infrastructure and engineering systems for LLM post-training, evaluation, and deployment with an accent on distributed systems, training orchestration, and feedback-driven model improvement. Focus on designing reliable pipelines, optimizing distributed workloads, and operationalizing research methods into production-grade ML systems.

Location: San Francisco or Palo Alto, California, United States

Company

Salesforce is a leading global software corporation specializing in AI, SaaS, and cloud solutions.

What you will do

  • Design, build, and maintain infrastructure for LLM post-training, evaluation, and deployment.
  • Own scalable pipelines for training orchestration, rollout generation, reward and feedback processing, checkpointing, and experiment management.
  • Build reliable systems for feedback-driven model improvement including human or AI feedback loops and large-scale offline evaluation.
  • Collaborate with research scientists, agent engineers, and platform teams to operationalize post-training methods and integrate with production stacks.
  • Optimize distributed training and inference workloads for reliability, throughput, cost efficiency, and observability.
  • Drive best practices for reproducibility, versioning, monitoring, deployment, and operational excellence across ML systems.

Requirements

  • Location: Must be based in San Francisco or Palo Alto, California, United States
  • 5+ years of experience in software engineering, ML systems, or distributed infrastructure.
  • Strong proficiency in Python and experience building production ML pipelines and infrastructure.
  • Experience with LLM post-training infrastructure including RLHF, reward modeling, and feedback-driven workflows.
  • Familiarity with cloud platforms (AWS, GCP) and containerized environments (Docker, Kubernetes).
  • Strong debugging skills and experience designing scalable distributed systems.

Nice to have

  • Experience with rollout systems, large-scale evaluation loops, or training data/feedback pipelines.
  • Familiarity with distributed training frameworks and modern ML infrastructure stacks.
  • Experience supporting agent-based learning, simulation environments, or iterative model improvement systems.
  • Prior experience working closely with AI research or incubation teams.

Culture & Benefits

  • Competitive compensation and strong long-term growth opportunities.
  • Work at the intersection of AI research and large-scale engineering systems.
  • Ownership of systems turning research models into production AI capabilities.
  • Inclusive hiring practices pursuant to San Francisco and Los Angeles Fair Chance ordinances.