Design, implement, and maintain scalable, highly available, and fault-tolerant systems to ensure the reliability of production environments.
Monitor system performance, identify potential issues, and respond to incidents promptly to minimize downtime and impact on users.
Develop and maintain automation tools for system deployment, monitoring, and troubleshooting to improve operational efficiency.
Collaborate with development and operations teams to ensure smooth system integration and continuous improvement of infrastructure.
Implement and manage continuous integration/continuous deployment (CI/CD) pipelines for automated application delivery and updates.
Develop and manage monitoring and alerting systems to track system health and performance metrics.
Conduct post-mortem analyses and provide root cause analysis (RCA) for incidents, identifying preventative measures for future reliability improvements.
Establish and enforce service level objectives (SLOs) and service level indicators (SLIs) to ensure application performance aligns with user expectations.
Manage capacity planning, scaling, and load balancing to ensure optimal system performance during high-demand periods.
Lead efforts in improving infrastructure resiliency through techniques like chaos engineering, failover testing, and disaster recovery planning.
需求條件
Bachelor’s degree in Computer Science, Engineering, Information Technology, or a related field (Master’s degree preferred).
5-7 years of experience in site reliability engineering, DevOps, or related fields, with a strong background in system administration or software engineering.
Proficiency in scripting and automation using languages such as Python, Go, Bash, or Ruby.
Strong experience with cloud platforms (e.g., AWS, Azure, Google Cloud) and containerization technologies (e.g., Docker, Kubernetes).
Expertise in monitoring tools (e.g., Prometheus, Grafana, Datadog) and logging frameworks (e.g., ELK stack, Splunk).
Familiarity with infrastructure-as-code (IAC) tools such as Terraform, Ansible, or CloudFormation.
Strong understanding of networking concepts, security best practices, and troubleshooting techniques for distributed systems.
Experience in setting and managing SLOs, SLIs, and monitoring for production environments.
Ability to perform root cause analysis and improve system reliability through proactive measures.
Excellent communication and collaboration skills, with the ability to work cross-functionally with developers, QA, and operations teams.
How strong is your resume?
Upload your resume and get feedback from our expert to help land this job
How strong is your resume?
Upload your resume and get feedback from our expert to help land this job