TL;DR

Lead Machine Learning Engineer (LLM Infrastructure): Build and maintain scalable infrastructure and engineering systems for LLM post-training, evaluation, and deployment with an accent on distributed systems, training orchestration, and feedback-driven model improvement. Focus on designing reliable pipelines, optimizing distributed workloads, and operationalizing research methods into production-grade ML systems.

Location: San Francisco or Palo Alto, California, United States

Company

Salesforce is a leading global software corporation specializing in AI, SaaS, and cloud solutions.

What you will do

Design, build, and maintain infrastructure for LLM post-training, evaluation, and deployment.
Own scalable pipelines for training orchestration, rollout generation, reward and feedback processing, checkpointing, and experiment management.
Build reliable systems for feedback-driven model improvement including human or AI feedback loops and large-scale offline evaluation.
Collaborate with research scientists, agent engineers, and platform teams to operationalize post-training methods and integrate with production stacks.
Optimize distributed training and inference workloads for reliability, throughput, cost efficiency, and observability.
Drive best practices for reproducibility, versioning, monitoring, deployment, and operational excellence across ML systems.

Requirements

Location: Must be based in San Francisco or Palo Alto, California, United States
5+ years of experience in software engineering, ML systems, or distributed infrastructure.
Strong proficiency in Python and experience building production ML pipelines and infrastructure.
Experience with LLM post-training infrastructure including RLHF, reward modeling, and feedback-driven workflows.
Familiarity with cloud platforms (AWS, GCP) and containerized environments (Docker, Kubernetes).
Strong debugging skills and experience designing scalable distributed systems.

Nice to have

Experience with rollout systems, large-scale evaluation loops, or training data/feedback pipelines.
Familiarity with distributed training frameworks and modern ML infrastructure stacks.
Experience supporting agent-based learning, simulation environments, or iterative model improvement systems.
Prior experience working closely with AI research or incubation teams.

Culture & Benefits

Competitive compensation and strong long-term growth opportunities.
Work at the intersection of AI research and large-scale engineering systems.
Ownership of systems turning research models into production AI capabilities.
Inclusive hiring practices pursuant to San Francisco and Los Angeles Fair Chance ordinances.

Lead Machine Learning Engineer, LLM Infrastructure

Описание вакансии