TL;DR
Lead Machine Learning Engineer (LLM Infrastructure): Build and maintain scalable infrastructure and engineering systems for LLM post-training, evaluation, and deployment with an accent on distributed systems, training orchestration, and feedback-driven model improvement. Focus on designing reliable pipelines, optimizing distributed workloads, and operationalizing research methods into production-grade ML systems.
Location: San Francisco or Palo Alto, California, United States
Company
Salesforce is a leading global software corporation specializing in AI, SaaS, and cloud solutions.
What you will do
- Design, build, and maintain infrastructure for LLM post-training, evaluation, and deployment.
- Own scalable pipelines for training orchestration, rollout generation, reward and feedback processing, checkpointing, and experiment management.
- Build reliable systems for feedback-driven model improvement including human or AI feedback loops and large-scale offline evaluation.
- Collaborate with research scientists, agent engineers, and platform teams to operationalize post-training methods and integrate with production stacks.
- Optimize distributed training and inference workloads for reliability, throughput, cost efficiency, and observability.
- Drive best practices for reproducibility, versioning, monitoring, deployment, and operational excellence across ML systems.
Requirements
- Location: Must be based in San Francisco or Palo Alto, California, United States
- 5+ years of experience in software engineering, ML systems, or distributed infrastructure.
- Strong proficiency in Python and experience building production ML pipelines and infrastructure.
- Experience with LLM post-training infrastructure including RLHF, reward modeling, and feedback-driven workflows.
- Familiarity with cloud platforms (AWS, GCP) and containerized environments (Docker, Kubernetes).
- Strong debugging skills and experience designing scalable distributed systems.
Nice to have
- Experience with rollout systems, large-scale evaluation loops, or training data/feedback pipelines.
- Familiarity with distributed training frameworks and modern ML infrastructure stacks.
- Experience supporting agent-based learning, simulation environments, or iterative model improvement systems.
- Prior experience working closely with AI research or incubation teams.
Culture & Benefits
- Competitive compensation and strong long-term growth opportunities.
- Work at the intersection of AI research and large-scale engineering systems.
- Ownership of systems turning research models into production AI capabilities.
- Inclusive hiring practices pursuant to San Francisco and Los Angeles Fair Chance ordinances.
