Site Reliability Engineer
Company and Job Overview
A world-class consulting service company sought to hire talented Site Reliability Engineer due to expansion. As a Site Reliability Engineer (SRE), you will play a key role in maintaining the reliability and performance of critical services. This role emphasizes strong system architecture and design principles, focusing on key SRE practices such as Service Level Objectives (SLOs), Service Level Indicators (SLIs), and the reduction of operational toil.
Job Descriptions:
Design and implement resilient system architectures and automation tools and scripts that support high availability and enhance operational efficiency and reduce manual effort.
Define, track, and analyze SLOs and SLIs to ensure reliability and performance meet business needs.
Conduct thorough post-mortem analyses following incidents, driving continuous improvement through root cause identification and solution implementation.
Collaborate with development and operations teams to establish best practices in system reliability and incident management.
Troubleshoot and resolve issues related to database performance, network connectivity, and deployment failures, including diagnosing problems at the underlying platform level (e.g., Kubernetes, virtual machines).
Ensure that issues are resolved within the stipulated Service Level Agreements (SLAs), maintaining high standards of service delivery.
Identify and troubleshoot performance bottlenecks across systems, providing actionable recommendations for enhancements.
Maintain detailed documentation of processes and incident responses to support knowledge sharing and compliance.
Actively focus on developing effective communications and relationship-building skills with stakeholders, clients and team.
Job Requirements:
Proficiency in programming languages such as Python, Golang, Java, or similar, focusing on operational efficiency.
Demonstrated experience in system architecture and design, prioritizing reliability, and scalability.
Strong understanding of SRE principles, including SLOs, SLIs, toil reduction, and incident post-mortems.
Experience with cloud environments (e.g., AWS, Azure, Google Cloud) and their operational management.
Familiarity with DevOps practices and frameworks, including CI/CD, infrastructure as code, and containerization.
Strong expertise in Linux system administration.
Familiarity with networking concepts and effective troubleshooting techniques.
Familiarity with monitoring tools and performance optimization techniques.
Experience in scripting or automation for system administration tasks.
Knowledge of networking concepts and troubleshooting methodologies.
Apply online or feel free to contact me directly for more information about this opportunity. Due to the high volume of applicants, we regret to inform you that only shortlisted candidates will be notified. Thank you for your understanding.
#LI-JACMY