TL;DR
SRE Engineer (System Engineering): Improving reliability and performance of a large-scale cloud platform with an accent on high availability, incident management, and automation. Focus on designing monitoring solutions, managing SLIs/SLOs, and building self-healing systems to ensure stable operations and fast incident resolution.
Location: Remote with presence in multiple countries including USA, Europe, and Latin America
Salary: €2350–€3700 per month (gross)
Company
Andersen cooperates with global leaders in FinTech, Healthcare, Retail, Telecom, and other industries, providing software development and IT services.
What you will do
- Maintain high system availability and reliability across cloud environments.
- Design and implement monitoring, logging, and alerting solutions.
- Define and manage SLIs, SLOs, and error budgets.
- Lead incident response, root cause analysis, and post-mortems.
- Automate operational tasks and improve CI/CD pipelines.
- Partner with engineering teams to enhance application resilience and participate in capacity planning and disaster recovery.
Requirements
- 5+ years experience as Site Reliability Engineer or Incident Management Engineer.
- Strong incident escalation skills and experience with Azure cloud and Kubernetes administration.
- Experience with containerization (Docker), Infrastructure as Code (Terraform), monitoring/APM tools, log aggregation, and distributed tracing.
- English level: Upper-Intermediate (B2) or higher.
- Spanish or Portuguese level: Intermediate+ or higher.
Culture & Benefits
- Work with leaders in multiple industries and opportunities for professional growth.
- Mentoring and adaptation systems for new employees.
- Additional earning opportunities up to $1000/month.
- Access to corporate training portal and certification compensation.
- Private health insurance, sports activity compensation, English courses, and vibrant corporate life.
