Alibaba Cloud

Senior Network Engineer

Sunnyvale, CA, US

$219.6k/year
about 1 month ago
Save Job

Summary

Job Description

1. Observability Link Construction for Operations and Maintenance

a. Have a global perspective on stability, capable of developing and implementing stability solutions.

b. Pre-event: Establish and continually optimize monitoring mechanisms for application operations and maintenance; develop and maintain corresponding monitoring platforms/tools.

c. During the event: Establish and continuously optimize warning mechanisms for application operations and maintenance, ensuring that faults can be quickly discovered, located, and addressed.

d. Post-event: Quickly analyze, diagnose, and locate problems, and collaborate with relevant personnel to resolve issues; establish and improve the rapid recovery service mechanism to reduce business impact and ensure stable business operations by identifying and eliminating potential risks through stability governance projects and architectural optimizations.

2. Stability Operations and Maintenance Platform Construction

a. Design, develop, and maintain reliable operations and maintenance platforms and tools, such as inspection systems, water level systems, delivery systems, cost management systems, etc., to address issues related to delivery, performance, stability, and cost encountered by production systems, ensuring business availability and enhancing performance and efficiency.

b. Responsible for data-driven analysis of operations and maintenance quality; analyze and study daily operations and maintenance metrics, issues, and risks to establish models and provide optimization suggestions for operations and maintenance.

3. Application Operations and Maintenance Standard Construction

a. Establish operation and maintenance process specifications and standardization (such as change standards, protection plans, cloud product configuration standards, etc.) to ensure the normativity and standardization of operations and maintenance, thereby enhancing stability.

b. Develop and implement emergency response specifications and standards for application operations and maintenance faults.

c. Develop and implement alarm handling specifications and standards for application operations and maintenance, as well as Service Level Agreements (SLA).

4. Resource Optimization

a. Based on business requirements, plan budget preparation, capacity planning, and readiness, and coordinate with development teams for predictions and estimates of resource consumption such as storage and computing.

b. Analyze business demands, ensuring stability while integrating water levels, specifications, and billing rules; control the reasonableness of resource estimation in technical solutions and collaborate with development to reduce resource costs.

5. Security Assurance Construction

a. 24/7 emergency response, daily monitoring alerts, and emergency handling, continuously identifying and rectifying existing issues.

b. Responsible for operations and maintenance support during major events (such as National Day, Spring Festival, New Year's Day, and significant activities).

c. Develop and drill emergency plans, respond to emergencies, and handle faults.

d. Establish a problem/fault record repository, conduct targeted analysis of the repository, and enhance and optimize the emergency plan repository and standard process repository.

6. Architecture Upgrade

a. Responsible for system architecture upgrades, such as kernel upgrades, architecture upgrades, inter-room service migration, and containerization transformation.

b. Responsible for the design and implementation of disaster recovery architecture, such as local disaster recovery and multi-active geographically distributed setups.


Job Requirements

1. Fluent in Chinese communication skills, able to clearly articulate technical issues and solutions.

2. Over 3 years of experience in operations and maintenance in related fields such as applications, networks, and containerization.

3. Basic mastery of professional abilities in architecture design, performance optimization, and stability optimization.

4. Capable of applying intelligent and automated operations and maintenance platforms and tools, designing and utilizing complex workflows and daily operational templates, quickly identifying, locating, and resolving relatively complex faults, thereby improving operational efficiency.

5. Able to summarize and consolidate issues discovered in daily operations and maintenance into operational experience, and apply this knowledge to enhance capabilities within the operations and maintenance platform.

6. Proficient in protocols such as TCP/IP, DNS, and HTTP, with the ability to perform preliminary analysis of network traffic and troubleshoot network issues.

7. Familiar with at least one cloud service platform (such as AWS, Alibaba Cloud, Azure, etc.) and its related mainstream products (such as Flink, MaxCompute, Log Service, RDS, Redis, etc.), able to preliminarily troubleshoot and resolve basic issues related to the use of corresponding cloud products.

8. Bonus Points: Familiarity with DPDK (Data Plane Development Kit) and experience in enhancing network processing performance.

9. Bonus Points: Some development capabilities to advance automation in operations and maintenance capabilities.

10. Bonus Points: Strong business understanding, capable of independently handling complex issues with real case examples.

11. Bonus Points: Possessing personal judgment regarding business issues, able to skillfully utilize processes and tools to identify risks and formulate solutions.

12. Bonus Points: Having a certain level of influence within the business line and able to gain recognition from surrounding teams.




The pay range for this position at commencement of employment is expected to be between $133,200/year and $219,600/year. However, base pay offered may vary depending on multiple individualized factors, including market location, job-related knowledge, skills, and experience.


If hired, employee will be in an “at-will position” and the Company reserves the right to modify base salary (as well as any other discretionary payment or compensation program) at any time, including for reasons related to individual performance, Company or individual department/team performance, and market factors.

How strong is your resume?

Upload your resume and get feedback from our expert to help land this job

People also searched: