45 Site Reliability Engineer jobs in Thailand
Site Reliability Engineer
Posted today
Job Viewed
Job Description
Thailand, Bangkok | Full Time | Technology
Site reliability engineers are responsible for improving the quality of software processes and services in production. They design code to automate processes to improve the efficiency of deliverables and act as a bridge between development and operations. SREs are responsible for the availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning of their service(s).
Working Location: Empire Tower (100% onsite, accessible via BTS Chong Nonsi)
Job DescriptionResponsibilities:
- Monitor the health of your services and work with developers to increase the velocity of changes using built-in support for service monitoring.
- Select metrics for SLIs, set SLOs, and track error budgets to mitigate risk for the service.
- Use powerful dashboards to aggregate metrics and logs, including golden signals to reduce MTTR and quickly answer questions about service health.
- Take ownership of platform-related incident management and resolution, ensuring timely communication and effective problem-solving.
- Automate various provisioning and maintenance tasks using scripts and automation tools
Qualifications:
year of experience as software engineer or systems administrator and willing to be
SRE in the future for Junior level.
- Minimum 5 year of experience as SRE for Senior level.
- Experience with coding at least one language (Bash, Python, PowerShell, etc.)
- Ability to use observability tools such as Datadog, Grafana, ElasticSearch, and Kibana
- Ability to use cloud services (AWS, etc.)
- Good command in English both spoken and written
Nice to have:
- Knowledge of best practices and IT operations in Always-Available and highly-scalable
services
- Experience with automation CI/CD tools (Github Actions, Jenkins, Ansible, Terraform,
etc.)
- Experience with containerization, container orchestration, microservices - Docker,
Kubernetes, (K8s), Helm
- Knowledge of IT service management (ITSM) - Incident management, problem
management, change management
We offer an attractive remuneration package, a fast-paced and exciting working environment, and provide challenging opportunities for life-long learning and career development.
Interested candidates are invited to send your comprehensive resume with current and expected salary package via this job ad. Please note that only shortlisted candidates will be notified.
Please consult our Candidate Privacy Notice to know more about how we collect, use, transfer and disclose our candidates' information:
By submitting your resume and information, you understand, acknowledge, and consent that your personal data will be processed in accordance with our Candidate Privacy Notice. You consent to the collection, use, transfer and disclosure of your personal data as well as to receive email and/or other electronic messaging communication from 2C2P.
Site Reliability Engineer
Posted today
Job Viewed
Job Description
Job Summary :
SRE Engineers are typically responsible for the availability and reliability in AWS cloud based of critical platform services and applications, ensuring they meet the requirements in terms of SLI, SLO and SLA. SRE Engineers also take part in on-call duties to fix cases related to support incident escalation. SRE engineers will collaborate with cross-function team to build and run sustainable product system.
Job Responsibilities :
- Be on a PagerDuty rotation to respond to availability incidents and provide support for service engineers with customer incidents.
- Debug production issues across services.
- Proposes ideas and solutions within the infrastructure team to reduce the workload by automation.
- Measure and optimize system performance, create dashboard, making capacity planning and innovating to continually improve.
- Improve reliability, quality, and time-to-market of our suite of software solutions
Qualifications:
- Bachelor's degree in computer science/engineering or other highly technical
- Ability to work under pressure.
- 1-3 years in AWS Cloud service. EC2, EKS, RDS, AWS batch, runbook script
- 1-3 years in DevOps tools ex. Jira, Gitlab, Confluence, Terraform.
- 1-3 years in Monitoring and Dashboard ex Prometheus , Grafana, ELK.
- Good knowledge in phyton or RPA ins preferable.
The successful candidate will be joined by a fully agile development team with a fully cross-functional team to deliver the brand new SCB banking channel. He/She will be experienced with modern processes and technologies in the market such as Continuous Delivery, Docker, Amazon Web Services, also many new testing techniques and tools, etc. Candidate will also be challenging to work with many parties such as Vendors, and SCB IT Department.
Only shortlisted candidate will be contacted.
Site Reliability Engineer
Posted today
Job Viewed
Job Description
Urgently Require
d
"Site Reliability Engineer / System Admin
" (Tong Hua Group
)
Responsibilities
:1. Maintain, monitor, and troubleshoot the company's cloud, blockchain, AI and associated business systems across on-premise and multi-cloud environments
.2. Deploy and manage applications on Linux platforms and virtualized infrastructure (Proxmox, VMware, OpenShift), handling system installations, configurations, and ongoing maintenance tasks
.3. Develop, implement, and manage CI/CD pipelines using tools such as GitHub Actions, Ansible, and Kubernetes to ensure seamless and efficient deployment workflows
.4. Design high-availability systems with load balancing (HAProxy, Nginx), caching (Redis), and failover configurations
.5. Conduct daily monitoring, data backup, and recovery using open-source monitoring tools (Prometheus, Grafana, Loki) for performance reporting, issue tracking, and proactive health checks
.6. Perform anomaly detection, root cause analysis, and automated alerting to address and prevent system failures and performance bottlenecks
.7. Automate operational tasks and improve system resilience through scripting (Bash, Python, or Golang) and configuration management tools
.8. Maintain and optimize infrastructure components such as Docker, Kubernetes, databases (PostgreSQL, MySQL), and distributed storage (Ceph, MinIO)
.9. Setup VPN, VPC, and secure networking for client environments with proper isolation and security
.10. Collaborate with cross-functional teams to support infrastructure improvements, incident response, and operational resilience
.
Requirement
s1. Bachelor's degree in Computer Science, Information Technology, or a related field, with 4+ years of relevant experience in DevOps, SRE, or similar roles
.2. Demonstrated experience with production-grade infrastructure in high-availability (e.g. load balancing) and high-performance environments (e.g. cache optimization)
.Proficiency in Linux administration and containerization (Docker, Kubernetes)
.3. Strong knowledge of CI/CD processes and automation tools (Ansible, Terraform) and experience scripting (Python, Shell) for operational automation
.4. Solid understanding of networking protocols (TCP/IP, DNS, DHCP) and networking expertise (VPN, VPC, firewalls)
.5. Hands-on experience with on-premise virtualization (VMware , ProxMox, OpenShift or similar) and cloud platforms
.6. Proficient in monitoring and logging solutions (Prometheus, Grafana, Loki) for proactive system management
.7. Familiarity with database management and distributed storage solutions, particularly PostgreSQL,YugabyteDB, Qdrant and MinIO
.8. Multi-cloud and hybrid environment experience
.9. Ability to communicate in English at a conversational level
.
Tong Hua Group · MRT Hua Lamphon
gIf you are interest, please send your updated CV with current and expected salary to my email :
mTel
0
Site Reliability Engineer
Posted today
Job Viewed
Job Description
About the OpportunityCathcart Technology is working with a leading international organisation on a large-scale Cloud transformation programme. As part of a pioneering project, they are migrating from traditional on-premise systems to Cloud.
As an SRE, you'll work closely with global teams while being the key presence locally — ensuring reliability, automation, and observability across critical projects.
Responsibilities
- Support cloud migration projects, moving systems from on-premise to private and public Cloud (AWS)
- Build and maintain monitoring solutions, automating detections and responses.
- Define and implement SLI and SLO metrics to ensure service availability.
- Deploy new application releases into pre-production and production environments.
- Drive automation in deployments, system reconfiguration, and monitoring improvements.
- Collaborate with development, DevOps, and testing teams on continuous delivery and quality assurance.
- Document incidents, solutions, and best practices while sharing knowledge with the wider SRE community.
What We're Looking For
- Bachelor's degree in Information Technology, Computer Engineering, or related field
- 5+ years' experience in DevOps, Cloud, System Engineering, or a related field
- Hands on experience with Kubernetes/OpenShift.
- Experience with public Cloud (AWS or Azure).
- Solid knowledge of Linux, VMs, and shell scripting
- Familiarity with CI/CD tools such as Jenkins.
- Experience with monitoring/logging tools (Nagios, Splunk, or similar).
- Good communication skills in English
This is an exciting opportunity to be part of a pioneer Cloud migration project, working on high-impact systems with international collaboration. If you're passionate about reliability, automation, and scalable infrastructure — this role is for you.
For more details, please contact Cathcart Technology
Site Reliability Engineer
Posted today
Job Viewed
Job Description
Job Summary :
SRE Engineers are typically responsible for the availability and reliability in AWS cloud based of critical platform services and applications, ensuring they meet the requirements in terms of SLI, SLO and SLA. SRE Engineers also take part in on-call duties to fix cases related to support incident escalation. SRE engineers will collaborate with cross-function team to build and run sustainable product system.
Job Responsibilities :
- Be on a PagerDuty rotation to respond to availability incidents and provide support for service engineers with customer incidents.
- Debug production issues across services.
- Proposes ideas and solutions within the infrastructure team to reduce the workload by automation.
- Measure and optimize system performance, create a dashboard, make capacity planning and innovate to improve continually.
- Improve reliability, quality, and time-to-market of our suite of software solutions
Qualifications:
- Bachelor's degree in computer science/engineering or other highly technical
- Ability to work under pressure & New Grad is welcome to apply
- 1-3 years in AWS Cloud service. EC2, EKS, RDS, AWS batch, runbook script
- 1-3 years in DevOps tools ex. Jira, Gitlab, Confluence, Terraform.
- 1-3 years in Monitoring and Dashboard ex Prometheus , Grafana, ELK.
- Good knowledge in phyton or RPA ins preferable.
The successful candidate will be part of a fully agile development team and a cross-functional team to deliver the brand-new SCB banking channel. He/She will be experienced with modern processes and
technologies in the market such as Continuous Delivery, Docker, Amazon Web Services, also many new testing techniques and tools, etc. Candidate will also be challenged to work with many parties such as Vendors, and SCB IT Department
Our Benefits :
- Bonus
- Birthday Leave
- Mobile Allowance / Internet Allowance
- Life / Accident Insurance
- SCB Tele Care
- Flexible Benefit
- Housing Loan
- Provident Fund
- Cooperative Fund
- Near BTS Phaholyothin 24 and BTS Ratchayothin
- Shuttle Bus to MRT / BTS
- Car Parking
- Sport club / Fitness / Sport activity
- Co-working space
SCB Tech X Co., LTD
Human Resources Division
18 SCB Park Plaza, Tower West A, 2nd Floor, Ratchadapisek Rd.,
Chatujak, Bangkok 10900 Thailand
(link removed)
Site Reliability Engineer
Posted today
Job Viewed
Job Description
We are looking for a Site Reliability Engineer (SRE) to own the reliability, security, and performance of our healthcare-critical application platform. This role combines hands-on infrastructure operations, automation, and vendor coordination for our on-premise hospital sites. You will be part of the team that ensures our systems run smoothly, securely, and with minimal downtime — whether in the cloud or at customer premises.
Key Responsibilities
Incident & Reliability Management
- Participate in an on-call rotation to respond to incidents impacting system availability.
- Support software engineers during customer incidents, providing expertise in root cause analysis and rapid mitigation.
- Debug production issues across services, infrastructure, and network layers.
- Build monitoring and alerting that triggers on symptoms, not just outages, to catch issues early.
Infrastructure & Automation
- Run and maintain infrastructure with Pulumi, GitHub Actions, ArgoCD, and Kubernetes.
- Implement automation for deployments, upgrades, and routine maintenance tasks.
- Document operational actions to create repeatable processes and automated solutions.
- Improve operational processes to enhance system uptime and reduce human intervention.
Security & Compliance
- Manage cloud and on-premise environments in accordance with company security guidelines and healthcare regulations (HIPAA, GDPR).
- Implement security automation from pre-commit to production stages.
- Collaborate with security teams to ensure patching, configuration management, and access controls are in place.
Vendor & On-Premise Site Management
- Coordinate with on-premise site vendors to ensure system reliability, upgrades, and maintenance are performed according to SLAs.
- Define technical requirements, monitoring standards, and incident escalation procedures for vendors.
- Review vendor performance, provide feedback, and ensure alignment with company operational standards.
- Support new site deployments, including infrastructure validation, vendor onboarding, and handover to operations.
Collaboration & Enablement
- Educate internal teams on new infrastructure tools, cloud capabilities, and operational best practices.
- Work closely with product, engineering, and security teams to design resilient and scalable architectures.
- Actively engage in capacity planning, performance tuning, and long-term infrastructure strategy.
Technology Stack Skills
- Languages: GoLang, Python, Shell Script
- Cloud: Azure Cloud
- CI/CD: GitHub Actions, ArgoCD, Pulumi, Terraform
- Kubernetes Ecosystem: Kubernetes, Kustomize, Helm
- Monitoring & Observability: Prometheus, Grafana
- Infrastructure: Linux/UNIX, Docker
- Networking: TCP/IP, DNS, HTTP, SMTP, distributed networks
- Databases: SQL and NoSQL (e.g., SQL Server, PostgreSQL, OpenSearch)
Qualifications
Required:
- Bachelor's degree in Computer Science, Engineering, Information Technology, or equivalent experience.
- 2+ years of software development experience in Go, Python, or Java.
- 2+ years in a Cloud Engineer or SRE role, with hands-on experience in Linux/UNIX, Docker, and Microsoft Azure.
- Strong understanding of microservices architecture and distributed systems.
- Experience managing production Kubernetes clusters and cloud infrastructure.
- Knowledge of monitoring, alerting, and incident management practices.
Preferred:
- Experience in vendor management for IT infrastructure or on-premise deployments.
- Healthcare IT experience with regulated environments.
- Familiarity with HL7, FHIR, DICOM, and healthcare integration patterns.
- Prior work with hybrid (cloud + on-premise) environments.
Site Reliability Engineer
Posted today
Job Viewed
Job Description
Skillet & experience required
• Expertise in High-Volume Transaction Systems:
- Proven experience in managing and optimizing high-volume transaction systems, preferably in banking or financial services.
• Strong Technical Background:
- Solid understanding of network, server, and application-level troubleshooting, with hands-on experience in using monitoring and observability tools (e.g., New Relic, Prometheus, ELK stack, Grafana).
• Proficiency in Programming and Scripting:
- Skills in programming and scripting languages (e.g., Python, Bash) to automate tasks and integrate systems.
• Experience with Cloud and Container Technologies:
- Knowledge of cloud service platforms (e.g., AWS, Azure, GCP) and container orchestration tools (e.g., Kubernetes, Docker) to deploy and manage services.
• Understanding of CI/CD Tools and Practices:
- Experience with CI/CD tools (e.g., Jenkins, Azure DevOps) and practices to facilitate rapid and safe deployments.
• Familiarity with Security Standards:
- Understanding of security best practices and compliance standards relevant to transaction processing and financial data.
• Analytical Skills and Problem-Solving:
- Strong analytical skills with the ability to solve complex problems under pressure.
Responsibilities
• Service Scalability and Optimization:
- Work closely with the cross-functional teams to define and establish service level objectives (SLOs) and service level agreements (SLAs) of the system
- Work closely with the development team to ensure the transaction services platform is scalable, identifying and addressing any scalability or performance limits.
- Work closely with the cross-functional teams to perform capacity planning and resource allocation to ensure optimal system performance and scalability.
- Optimize the performance of the transaction services to handle peak loads efficiently.
• Transaction services Performance and Reliability Monitoring:
- Work collaboratively to develop and maintain monitoring tools, alerts, dashboards and processes to provide visibility into health, performance and reliability of transaction services, ensuring they meet SLAs.
- Setup monitoring system to measure key reliability metrics ( i.e. MTTF, MTTR MTBF, MTTD etc. )
- Analyze transaction patterns and identify potential bottlenecks or failure points in the platform.
• Incident Response and Troubleshooting:
- Work collaboratively with the Bank Operations support team as the first responder for any issues within the transaction services platform, employing a systematic troubleshooting approach to resolve issues quickly
- Develop and refine incident response protocols to minimize downtime and transaction failures.
- Conduct post-incident analyses to identify root causes and implement preventive measures to avoid future incidents.
• Continuous Integration/Continuous Deployment (CI/CD) for Transaction Services:
- Implement and maintain CI/CD pipelines for transaction services, ensuring smooth and reliable deployments with minimal impact on live environments.
- Automate service deployment and rollback procedures to enhance operational efficiency.
- Automate repetitive tasks and processes to improve efficiency and reduce manual intervention
• Security and Compliance Assurance:
- Ensure that all aspects of the transaction services platform adhere to industry security standards and compliance requirements, particularly those related to financial transactions.
- Work with the security team to implement and maintain security measures, such as encryption and access controls, to protect transaction data.
Working Location: Saengthong Thani Tower (Near BTS Chong Nonsri), Bangkok
If you require more information, please contact K. Pongpon Suksai (พงศ์พล) Tel
Be The First To Know
About the latest Site reliability engineer Jobs in Thailand !
Site Reliability Engineer
Posted today
Job Viewed
Job Description
Skillet & experience required
• Expertise in High-Volume Transaction Systems:
- Proven experience in managing and optimizing high-volume transaction systems, preferably in banking or financial services.
• Strong Technical Background:
- Solid understanding of network, server, and application-level troubleshooting, with hands-on experience in using monitoring and observability tools (e.g., New Relic, Prometheus, ELK stack, Grafana).
• Proficiency in Programming and Scripting:
- Skills in programming and scripting languages (e.g., Python, Bash) to automate tasks and integrate systems.
• Experience with Cloud and Container Technologies:
- Knowledge of cloud service platforms (e.g., AWS, Azure, GCP) and container orchestration tools (e.g., Kubernetes, Docker) to deploy and manage services.
• Understanding of CI/CD Tools and Practices:
- Experience with CI/CD tools (e.g., Jenkins, Azure DevOps) and practices to facilitate rapid and safe deployments.
• Familiarity with Security Standards:
- Understanding of security best practices and compliance standards relevant to transaction processing and financial data.
• Analytical Skills and Problem-Solving:
- Strong analytical skills with the ability to solve complex problems under pressure.
Responsibilities
• Service Scalability and Optimization:
- Work closely with the cross-functional teams to define and establish service level objectives (SLOs) and service level agreements (SLAs) of the system
- Work closely with the development team to ensure the transaction services platform is scalable, identifying and addressing any scalability or performance limits.
- Work closely with the cross-functional teams to perform capacity planning and resource allocation to ensure optimal system performance and scalability.
- Optimize the performance of the transaction services to handle peak loads efficiently.
• Transaction services Performance and Reliability Monitoring:
- Work collaboratively to develop and maintain monitoring tools, alerts, dashboards and processes to provide visibility into health, performance and reliability of transaction services, ensuring they meet SLAs.
- Setup monitoring system to measure key reliability metrics ( i.e. MTTF, MTTR MTBF, MTTD etc. )
- Analyze transaction patterns and identify potential bottlenecks or failure points in the platform.
• Incident Response and Troubleshooting:
- Work collaboratively with the Bank Operations support team as the first responder for any issues within the transaction services platform, employing a systematic troubleshooting approach to resolve issues quickly
- Develop and refine incident response protocols to minimize downtime and transaction failures.
- Conduct post-incident analyses to identify root causes and implement preventive measures to avoid future incidents.
• Continuous Integration/Continuous Deployment (CI/CD) for Transaction Services:
- Implement and maintain CI/CD pipelines for transaction services, ensuring smooth and reliable deployments with minimal impact on live environments.
- Automate service deployment and rollback procedures to enhance operational efficiency.
- Automate repetitive tasks and processes to improve efficiency and reduce manual intervention
• Security and Compliance Assurance:
- Ensure that all aspects of the transaction services platform adhere to industry security standards and compliance requirements, particularly those related to financial transactions.
- Work with the security team to implement and maintain security measures, such as encryption and access controls, to protect transaction data.
Working Location: Saengthong Thani Tower (Near BTS Chong Nonsri), Bangkok
If you require more information, please contact K. Pongpon Suksai (พงศ์พล) Tel
Site Reliability Engineer
Posted today
Job Viewed
Job Description
Responsibilities
- Maintain and scale LINE products and services to hundreds of millions of users from Thailand and around the world
- Monitor and maintain health and availability of our production services in order to prevent outages and issues
- Manage our continuous integration and continuous delivery platform all the way from development to production
- Automate various provisioning and maintenance tasks using scripts and automation tools
- Participate in our cross-functional product development teams and handle DevOps tasks in products
- Help improve overall team productivity relating to development, testing and deployment
Qualifications
- Bachelor's degree in any field
- Strong background in Linux/Unix administration
- Ability to use a wide variety of open-source technologies and cloud services (AWS, Google Cloud, OpenStack, etc.)
- Strong grasp of automation CI/CD tools (ArgoCD, Github Actions, Jenkins, Ansible, Terraform, etc.)
- Working understanding of scripting languages (Shell script, Python, Lua, etc.)
- Knowledge and experience in monitoring and troubleshooting tools (ELK, Grafana, Prometheus, Sentry, Opentelemetry, etc.)
- Experience working with Docker in production and container orchestration (Kubernetes, Rancher)
- Knowledge of several database technologies (MySQL, Postgres, Redis, MongoDB, etc.)
- Knowledge of message queue technologies (Kafka, RabbitMQ, etc.)
- Knowledge of best practices and IT operations in always-available and highly-scalable services
- Knowledge of programming languages (Golang, etc.) is a plus.
Location
LINE Thailand Head Office, Gaysorn Tower, Bangkok
Senior Site Reliability Engineer
Posted today
Job Viewed
Job Description
We are seeking a highly motivated and experienced Senior Site Reliability Engineer (SRE) to join our growing team. As a Senior SRE, you will play a critical role in ensuring the reliability, scalability, and performance of our production systems. You will leverage your deep understanding of infrastructure, automation, and observability to champion operational excellence and build a resilient platform.
Key Responsibilities:
- Manage and operate our Kubernetes platform, ensuring high availability, performance, and security.
- Design, develop, and implement automation solutions for operational tasks, infrastructure provisioning, and application deployment.
- Build and maintain a comprehensive observability stack (monitoring, logging, tracing) to proactively identify and resolve issues.
- Implement and maintain proactive measures to ensure platform stability, performance optimization, and capacity planning.
- Provide support and expertise for critical middleware tools such as RabbitMQ, Redis, and Kafka, ensuring their optimal performance and reliability.
- Participate in our on-call rotation, troubleshoot and resolve production incidents efficiently, and implement preventative measures.
- Collaborate effectively with development and other engineering teams.
Qualification:
- Positive attitude and empathy for others.
- Passion for developing and maintaining reliable, scalable infrastructure.
- A minimum of 3 years of working experience in relevant areas.
- Experience in managing and operating Kubernetes in a production environment.
- Experienced with cloud platforms like AWS or GCP.
- Experienced with high availability, high-scale, and performance systems.
- Understanding of cloud-native architectures.
- Experienced with DevSecOps practices.
- Strong scripting and automation skills using languages like Python, Bash, or Go.
- Proven experience in building and maintaining CI/CD pipelines (e.g., Jenkins, GitLab CI).
- Deep understanding of monitoring, logging, and tracing tools and techniques.
- Experience with infrastructure-as-code tools (e.g., Terraform, Ansible).
- Strong understanding of Linux systems administration and networking concepts.
- Experience working with middleware technologies like RabbitMQ, Redis, and Kafka.
- Excellent problem-solving and troubleshooting skills.
- Excellent communication and collaboration skills.
- Strong interest and ability to learn any new technical topic.