Role Summary:
As a Site Reliability Engineer (SRE), you’ll play a crucial role in ensuring our platform is secure, reliable, scalable, and performant. You will work closely with engineering, product, and security teams to build resilient infrastructure, automate operations, improve observability, and respond to incidents proactively.
Mandatory Skill Sets/Expertise:
- 4+ years of experience in SRE/DevOps or cloud infrastructure roles.
- Strong expertise in AWS (EC2, RDS, S3, VPC, IAM, ECS/EKS, CloudWatch, etc.)
- Experience with IaC tools like Terraform, AWS CDK, or CloudFormation
- Hands-on with CI/CD tools (GitHub Actions, GitLab CI, CircleCI, or similar)
- Proficient in at least one scripting language (Python, Bash, etc.)
- Strong understanding of Linux-based systems, networking, DNS, and load balancing
- Experience with monitoring/logging tools (Prometheus, Grafana, ELK, Datadog, etc.)
- Experience with containerization and orchestration (Docker, ECS, EKS, or Kubernetes)
- Familiarity with security controls, vulnerability management, and cloud compliance frameworks
- Good communication and incident management skills
Nice to Have:
- Experience in a SaaS product or GRC/compliance/security platform
- Knowledge of Zero Trust architectures and secure cloud networking
- Experience with serverless architectures (Lambda, Step Functions)
- Familiarity with multi-tenant SaaS infrastructure and subdomain-based configurations
- Experience supporting SOC2/ISO audits from the infrastructure side
Key Responsibilities:
- Infrastructure Management: Design, build, and maintain scalable and secure cloud infrastructure (AWS preferred)
- Reliability & Performance: Monitor system reliability, uptime, and performance metrics. Establish and track SLOs/SLIs/SLAs
- Automation: Develop Infrastructure as Code (IaC) using Terraform/ Cloud Formation / CDK, and automate deployment pipelines (CI/CD)
- Incident Management: Lead incident response, root cause analysis, and postmortem documentation
- Observability: Implement and maintain observability tools (e.g., Prometheus, Grafana, Datadog, ELK, OpenTelemetry)
- Security & Compliance: Work with the security team to ensure the infrastructure is compliant with SOC2, ISO 27001, etc.
- Cost Optimization: Monitor cloud costs and optimize infrastructure usage
- Collaboration: Partner with developers to ensure applications are production-ready, fault-tolerant, and meet operational standards