ASUS

RD21795 Site Reliability Engineer (AI應用)

Taipei City, TW

21 days ago
Save Job

Summary

工作說明

  • Design, implement, and maintain scalable, highly available, and fault-tolerant systems to ensure the reliability of production environments.
  • Monitor system performance, identify potential issues, and respond to incidents promptly to minimize downtime and impact on users.
  • Develop and maintain automation tools for system deployment, monitoring, and troubleshooting to improve operational efficiency.
  • Collaborate with development and operations teams to ensure smooth system integration and continuous improvement of infrastructure.
  • Implement and manage continuous integration/continuous deployment (CI/CD) pipelines for automated application delivery and updates.
  • Develop and manage monitoring and alerting systems to track system health and performance metrics.
  • Conduct post-mortem analyses and provide root cause analysis (RCA) for incidents, identifying preventative measures for future reliability improvements.
  • Establish and enforce service level objectives (SLOs) and service level indicators (SLIs) to ensure application performance aligns with user expectations.
  • Manage capacity planning, scaling, and load balancing to ensure optimal system performance during high-demand periods.
  • Lead efforts in improving infrastructure resiliency through techniques like chaos engineering, failover testing, and disaster recovery planning.

需求條件

  • Bachelor’s degree in Computer Science, Engineering, Information Technology, or a related field (Master’s degree preferred).
  • 5-7 years of experience in site reliability engineering, DevOps, or related fields, with a strong background in system administration or software engineering.
  • Proficiency in scripting and automation using languages such as Python, Go, Bash, or Ruby.
  • Strong experience with cloud platforms (e.g., AWS, Azure, Google Cloud) and containerization technologies (e.g., Docker, Kubernetes).
  • Expertise in monitoring tools (e.g., Prometheus, Grafana, Datadog) and logging frameworks (e.g., ELK stack, Splunk).
  • Familiarity with infrastructure-as-code (IAC) tools such as Terraform, Ansible, or CloudFormation.
  • Strong understanding of networking concepts, security best practices, and troubleshooting techniques for distributed systems.
  • Experience in setting and managing SLOs, SLIs, and monitoring for production environments.
  • Ability to perform root cause analysis and improve system reliability through proactive measures.
  • Excellent communication and collaboration skills, with the ability to work cross-functionally with developers, QA, and operations teams.

How strong is your resume?

Upload your resume and get feedback from our expert to help land this job