1660-01-5-8888

Key Responsibilities. 

  • CI/CD and Pipeline Engineering
  • Design, implement, and maintain fault-tolerant CI/CD pipelines using GitHub Actions and Jenkins, built for high-frequency, zero-downtime releases on mission-critical systems.
  • Enforce multi-stage pipeline gates including automated security scanning (SAST/DAST) from scratch, linting, test coverage thresholds, and compliance checks before any promotion to production.
  • Implement blue/green and canary deployment strategies to eliminate risk in production rollouts, with automated rollback triggers on failure signals.
  • Establish and enforce SLAs on pipeline execution time and build reliability, driving continuous improvement cycles with development teams. 

 

Kubernetes and Container Orchestration 

  • Operate and maintain production-grade Kubernetes clusters across RKE2, K3S, and Nutanix Kubernetes Platform, ensuring 99.9%+ availability with proper multi-zone failover configuration. 
  • Define and enforce Pod Disruption Budgets, Horizontal Pod Autoscalers, and resource quotas to guarantee service continuity under load spikes and node failures. (Critical requirement) 
  • Manage etcd backup, cluster upgrades, and node lifecycle with zero service interruption using rolling strategies and scheduled maintenance windows. 
  • Deploy and manage workloads using Helm charts with environment-specific value overrides, lifecycle hooks, and release management for best practices. 
  • Implement network policies, RBAC, and pod security standards across namespaces to enforce least-privilege access and blast-radius containment in sensitive environments. 
  • Maintain and regularly test disaster recovery runbooks for cluster-level failures. RTO/RPO targets must be documented, rehearsed, and consistently met. 

 

Observability and Incident Response 

  • Build and own the full observability stack covering metrics (Prometheus/Grafana), distributed tracing (OpenTelemetry), and structured logging (ELK/Signoz) across all production services end-to-end. (Strong plus) 
  • Define, tune, and own alerting rules and SLO/SLI dashboards, reducing MTTD and MTTR through proactive anomaly detection rather than reactive firefighting. 
  • Lead and participate in on-call rotations for Sev1/Sev2 production incidents and drive blameless post-mortems with documented, actionable follow-through.
  • Instrument services with Open Telemetry SDKs in collaboration with development teams and standardize trace context propagation across microservices. (Strong plus)
  • Maintain runbooks and escalation playbooks for all critical failure modes as living documentation, reviewed and tested on a quarterly basis. 

 

Cloud Infrastructure and Security 

  • Architect and manage AWS-based infrastructure including VPC, EKS, RDS, S3, IAM, and CloudWatch using Infrastructure-as-Code tools such as Terraform, with GitOps-based review cycles.
  • Enforce security hardening at every layer: secrets management via Vault or AWS Secrets Manager, TLS everywhere, container image scanning, SBOM generation, and CIS Benchmark compliance. (Critical requirement)
  • Design and maintain multi-region or multi-AZ active-passive and active-active topologies with automated failover for both stateful and stateless workloads.
  • Manage Git branching strategies, protected branch policies, and code review workflows aligned with change management requirements for regulated or high-sensitivity environments.
  • Collaborate with security and compliance teams to meet audit, data residency, and access control requirements applicable to the system’s sensitivity classification. 

 

Required Skills and Qualifications 

  • 3 to 5 years of experience as a DevOps Engineer, Platform Engineer, or Site Reliability Engineer in production environments.
  • Strong hands-on experience designing and operating CI/CD pipelines using GitHub Actions and Jenkins with release gating, rollback automation, and pipeline-as-code practices.
  • Practical, production-level experience with Kubernetes including RKE2, K3S, and Nutanix Kubernetes Platform for deployments, scaling, configuration, and incident response.
  • Working knowledge of Helm charts for Kubernetes application packaging, deployment, upgrades, and environment-specific configuration management.
  • Solid experience with AWS and AWS-based deployment workflows. Infrastructure-as-Code experience with Terraform is required.
  • Strong Linux systems administration and command-line proficiency in RHEL or Ubuntu-based environments.
  • Good understanding of Git, GitHub, branching strategies, protected branches, and code review workflows suitable for regulated or sensitive systems.
  • Experience with container security practices including image scanning, SBOM generation, secrets management, and runtime security controls.
  • Familiarity with HA patterns such as blue/green deployments, canary releases, and circuit breaking in production Kubernetes workloads.
  • Ability to participate in on-call rotations, lead incident responses, and contribute to blameless post-mortems with clear action items. 
  • Experience with ELK, Open Telemetry, and Signoz for full-stack observability and advanced monitoring is a significant advantage. 

 

Please send your CV at vacancy@khalti.com

Khalti खाता छैन?  

Download now For more updates about Khalti’s campaign, events, services, and offer, you can also follow us on our official Facebook page, YoutubeTwitterViberLinkedin, and Instagram.