Key Responsibilities.
- CI/CD and Pipeline Engineering
- Design, implement, and maintain fault-tolerant CI/CD pipelines using GitHub Actions and Jenkins, built for high-frequency, zero-downtime releases on mission-critical systems.
- Enforce multi-stage pipeline gates including automated security scanning (SAST/DAST) from scratch, linting, test coverage thresholds, and compliance checks before any promotion to production.
- Implement blue/green and canary deployment strategies to eliminate risk in production rollouts, with automated rollback triggers on failure signals.
- Establish and enforce SLAs on pipeline execution time and build reliability, driving continuous improvement cycles with development teams.
Kubernetes and Container Orchestration
- Operate and maintain production-grade Kubernetes clusters across RKE2, K3S, and Nutanix Kubernetes Platform, ensuring 99.9%+ availability with proper multi-zone failover configuration.
- Define and enforce Pod Disruption Budgets, Horizontal Pod Autoscalers, and resource quotas to guarantee service continuity under load spikes and node failures. (Critical requirement)
- Manage etcd backup, cluster upgrades, and node lifecycle with zero service interruption using rolling strategies and scheduled maintenance windows.
- Deploy and manage workloads using Helm charts with environment-specific value overrides, lifecycle hooks, and release management for best practices.
- Implement network policies, RBAC, and pod security standards across namespaces to enforce least-privilege access and blast-radius containment in sensitive environments.
- Maintain and regularly test disaster recovery runbooks for cluster-level failures. RTO/RPO targets must be documented, rehearsed, and consistently met.
Observability and Incident Response
- Build and own the full observability stack covering metrics (Prometheus/Grafana), distributed tracing (OpenTelemetry), and structured logging (ELK/Signoz) across all production services end-to-end. (Strong plus)
- Define, tune, and own alerting rules and SLO/SLI dashboards, reducing MTTD and MTTR through proactive anomaly detection rather than reactive firefighting.
- Lead and participate in on-call rotations for Sev1/Sev2 production incidents and drive blameless post-mortems with documented, actionable follow-through.
- Instrument services with Open Telemetry SDKs in collaboration with development teams and standardize trace context propagation across microservices. (Strong plus)
- Maintain runbooks and escalation playbooks for all critical failure modes as living documentation, reviewed and tested on a quarterly basis.
Cloud Infrastructure and Security
- Architect and manage AWS-based infrastructure including VPC, EKS, RDS, S3, IAM, and CloudWatch using Infrastructure-as-Code tools such as Terraform, with GitOps-based review cycles.
- Enforce security hardening at every layer: secrets management via Vault or AWS Secrets Manager, TLS everywhere, container image scanning, SBOM generation, and CIS Benchmark compliance. (Critical requirement)
- Design and maintain multi-region or multi-AZ active-passive and active-active topologies with automated failover for both stateful and stateless workloads.
- Manage Git branching strategies, protected branch policies, and code review workflows aligned with change management requirements for regulated or high-sensitivity environments.
- Collaborate with security and compliance teams to meet audit, data residency, and access control requirements applicable to the system’s sensitivity classification.
Required Skills and Qualifications
- 3 to 5 years of experience as a DevOps Engineer, Platform Engineer, or Site Reliability Engineer in production environments.
- Strong hands-on experience designing and operating CI/CD pipelines using GitHub Actions and Jenkins with release gating, rollback automation, and pipeline-as-code practices.
- Practical, production-level experience with Kubernetes including RKE2, K3S, and Nutanix Kubernetes Platform for deployments, scaling, configuration, and incident response.
- Working knowledge of Helm charts for Kubernetes application packaging, deployment, upgrades, and environment-specific configuration management.
- Solid experience with AWS and AWS-based deployment workflows. Infrastructure-as-Code experience with Terraform is required.
- Strong Linux systems administration and command-line proficiency in RHEL or Ubuntu-based environments.
- Good understanding of Git, GitHub, branching strategies, protected branches, and code review workflows suitable for regulated or sensitive systems.
- Experience with container security practices including image scanning, SBOM generation, secrets management, and runtime security controls.
- Familiarity with HA patterns such as blue/green deployments, canary releases, and circuit breaking in production Kubernetes workloads.
- Ability to participate in on-call rotations, lead incident responses, and contribute to blameless post-mortems with clear action items.
- Experience with ELK, Open Telemetry, and Signoz for full-stack observability and advanced monitoring is a significant advantage.
Please send your CV at vacancy@khalti.com
Khalti खाता छैन?
Download now For more updates about Khalti’s campaign, events, services, and offer, you can also follow us on our official Facebook page, Youtube, Twitter, Viber, Linkedin, and Instagram.