Gamania Digital Entertainment

Site Reliability Engineer

21 days ago

Save Job

<Responsibilities>

Maintain and optimize infrastructure.
Monitor the service and fix any problem that occurs in the shortest possible time.
Monitor the usage of various resource indicators and the overall status of the system and perform optimization.
Maintain system stability and deal with emergencies.
Avoid system failures and service interruptions.
Work with other teams to continuously improve system architecture and service quality.
Design and build a complete process for system maintenance, deployment, and system upgrade.

<Required Skills>

Experience implementing monitoring service and log collecting system
Familiar with Linux
Understands containers and Kubernetes management and scheduling
Experience in RDBMS and NoSQL cluster service implementation
Experience in access control management of infrastructure and information security management
Familiar with IaC tool, such as terraform
Familiar with container technology, such as docker, containerd, podman
Familiar with common monitoring system in Kubernetes, such as Prometheus

<Preferred Skills>

Experience deploying at least one of the following cloud services: Azure, GCP, AWS
Understands DevOps and its concept
Familiar with basic network architecture, such as HTTP/HTTPS, TCP/IP, DNS, CDN
Familiar with setting up and adjusting configuration of web server, such as NGINX, Apache
Experience in CI tools (e.g., Gitlab CI, Jenkins, Github Actions) for deploying, setting up, and maintaining the service
Familiar with integrations and operations between different monitoring system (Zabbix, Cacti, Nagios, Smokeping, etc.), as well as all types of logs and system generated data (Logstash + Elasticsearch + Kibana ELK or Splunk, New Relic, Prometheus)
Familiar with distributed tracing platform and network architecture design, e.g., Opentelemetry, Jaeger, Tempo