Job Description
About the Role
As an L3 Site Reliability Engineer, you will serve as the highest technical escalation point for complex production issues while driving reliability engineering initiatives across cloud and infrastructure platforms.
This role combines Site Reliability Engineering (SRE) practices with advanced Level 3 support responsibilities, focusing on system availability, automation, observability, incident management, and continuous service improvement.
Key Responsibilities
- Define, monitor, and improve SLIs, SLOs, and service reliability metrics
- Act as the L3 escalation point for critical incidents and complex production issues
- Perform root cause analysis (RCA) and implement permanent fixes
- Lead major incident response and service restoration activities
- Plan and execute upgrades, patching, and infrastructure changes following ITIL processes
- Participate in CAB reviews and change management activities
- Build...