TL;DR

Principal Platform Engineer (Distributed Data Platforms): Defining and evolving architectural vision and operating standards for large-scale distributed data platforms with an accent on leading design, ensuring scalability, resilience, and security. Focus on solving complex distributed systems problems, influencing engineering teams, and establishing reliability and observability standards in multi-cluster Kubernetes environments.

Company

Smarsh empowers over 6500 organizations in regulated industries to manage risk and intelligence in their digital communications, leveraging AI/ML technology and recognized for consistent leadership and aggressive growth.

What you will do

  • Define and evolve architectural vision and operating standards for large-scale distributed data platforms.
  • Lead the design and evolution of highly available, scalable clusters for MongoDB, Elasticsearch, and Apache Kafka.
  • Solve complex, ambiguous distributed systems problems, balancing scalability, resilience, performance, security, and cost.
  • Influence engineering teams to adopt platform standards, automation practices, and self-service capabilities.
  • Partner with leadership to align platform capabilities with strategic business objectives.
  • Establish reliability targets, operational models, and observability standards for stateful workloads on multi-cluster Kubernetes.

Requirements

  • 8+ years of experience in platform engineering, SRE, or distributed systems-focused roles.
  • Demonstrated subject matter expertise in operating MongoDB, Kafka, or ElasticSearch at high scale with deep day-2 operational knowledge.
  • Significant experience designing and operating large-scale Kubernetes environments and associated tooling (Helm, Kustomize, ArgoCD).
  • Proven experience defining architectural standards and influencing technical direction.
  • Strong programming skills (Python, Java, or similar) with experience building internal platform APIs and automation tooling.
  • Extensive experience with Infrastructure as Code (Terraform) and cloud-native deployment models, specifically with enterprise-scale workloads in AWS.
  • Experience designing and evolving observability platforms (Prometheus/Grafana, ELK) for multi-cluster environments.
  • Strong understanding of security principles and experience embedding best practices into production environments.

Culture & Benefits

  • Lifelong learners passionate about innovating with purpose, humility, and humor.
  • Emphasis on collaboration, working with popular communications platforms and cloud infrastructure providers.
  • Utilizes the latest in AI/ML technology for customers.
  • Global organization valuing diversity and authenticity.
  • Recognized as a Best Place to Work by Comparably.com for leadership, culture, and commitment to development.