TL;DR

Senior Staff Software Engineer: Leading the reliability and resilience strategy for a global digital bank with an accent on defining technical strategy and driving resilience engineering. Focus on executing Chaos Engineering experiments, Disaster Recovery simulations and implementing robust SLOs and SLIs.

Location: Hybrid in Toronto, Canada

Company

Nubank is a digital banking platform and technology company redefining people's relationships with money across Latin America.

What you will do

  • Define the long-term roadmap for reliability and resilience to align with global expansion and regulatory requirements.
  • Execute Chaos Engineering experiments and Disaster Recovery simulations to mitigate systemic vulnerabilities.
  • Implement robust SLOs and SLIs across the organization to help product teams balance innovation speed with system stability.
  • Provide product squads with the training and architectural patterns necessary to improve their independent operational excellence.

Requirements

  • Expertise in architecting and maintaining high-availability systems in public cloud environments, preferably AWS.
  • Deep experience in advanced root cause analysis and creating feedback loops that prevent incident recurrence.
  • Hands-on experience defining and implementing SLOs, SLIs, and error budgets in distributed microservices architectures.
  • Real-world experience implementing Chaos Engineering and Disaster Recovery planning in production-scale environments.

Nice to have

  • Experience setting technical direction and coordinating large-scale projects across multiple teams.
  • Background in Site Reliability Engineering (SRE)
  • Familiarity with multi-region or multi-cell architecture patterns.

Culture & Benefits

  • Health Insurance, Life Insurance, and Pension Plan.
  • Extended maternity and paternity leaves.
  • Learning platform of courses and language learning program.
  • Mental health and wellness assistance program.
  • Vacations.