SRE (Site Reliability Engineer)
Responsibilities:
Actively participate in daily development and operations work with the business team, and deeply collaborate with the OPS team to support the business team's daily operational needs.
Design and implement emergency response strategies for business system anomalies, ensuring the fastest and most effective handling of service anomalies, and regularly organize drills to ensure the effectiveness of emergency strategies.
Identify performance bottlenecks and technical risks in business systems, design and implement optimization plans to ensure the scalability, flexibility, and high performance of business architecture.
Responsible for stress testing and capacity management of business systems, and optimize resource allocation based on test results to ensure system stability and performance under high loads.
Participate in stability-related development work for business systems, enhancing system stability and reliability.
Responsible for building the observability capabilities of business systems, ensuring timely detection and handling of system anomalies.
Participate in on-call duties (on-call hours: 21:00 - 6:00 UTC+8).
Requirements:
Bachelor's degree or higher in Computer Science or a related field.
Over 5 years of relevant work experience, with experience in Golang development preferred; experience in trading/market systems development is a plus.
Proficient in cloud computing platforms and containerization technologies, such as AWS, Aliyun, and Kubernetes, with substantial practical experience.
Familiar with the Linux operating system and common command-line tools, with strong troubleshooting skills.
Excellent teamwork and communication skills, able to collaborate with different technical teams.
Highly responsible and capable of taking on important tasks and challenges for the team.