Job VC

Senior SRE Engineer (Decentralized systems)

Sig.Network · djinni · Senior · $$$$ · Тільки віддалено Весь світ

Technologies

Bash CI/CD DevOps Docker GCP GitHub GitHub Actions Grafana IaC JavaScript Kubernetes Node.js Prometheus Python Rust Telemetry Terraform TypeScript

Description

The Role
We are looking for a Senior Site Reliability Engineer to
own
the reliability, deployment, and operational excellence of our decentralized network.
You will be responsible for ensuring that our infrastructure and protocol operate reliably in production, working closely with both the developer team and external node operators.

Why join?
Work on a real-world decentralized system
Help define and evolve SRE practices from the ground up
Work in a startup environment with a flexible, open culture that prioritizes autonomy, initiative, and ownership
Solve challenges across heterogeneous, externally managed infrastructure (node operators)
No micromanagement
Remote-first with flexible hours and PTO
All code is open source

Responsibilities
Own deployment pipelines across environments (dev → production)
Build and maintain an observability stack (Prometheus, Grafana, logs, tracing)
Define and manage SLIs / SLOs
Work directly with node operators to debug issues and ensure network uptime and consistency
Participate in and improve on-call rotations and incident response processes
Investigate incidents, lead response, and write clear postmortems
Improve system reliability, scalability, and fault tolerance
Continuously reduce operational toil through automation and system improvements
Collaborate with protocol/backend engineers to improve production readiness

Key Requirements
Experience
4+ years as SRE / DevOps / Infrastructure Engineer
Strong knowledge of:
Kubernetes (production-grade)
Docker/containerization
Terraform (or other IaC tools)
Cloud platforms (preferably GCP)
Observability:
Prometheus
Grafana
Systems:
Monitoring and alerting systems
High-availability systems
Engineering:
Ability to write scripts/tools in JS/TS, Bash, Python (Rust is a plus)
Experience managing production incidents and on-call rotations
Communication:
Strong communication skills (you’ll work with external operators)

Nice to have
Experience with blockchain/decentralized systems
Experience working with validators/node operators
Familiarity with:
OpenTelemetry
CI/CD pipelines (e.g., GitHub Actions)
Experience building internal infrastructure tooling
Background in security and production hardening

Culture / Expectations
Self-driven, proactive, and ownership-oriented
Comfortable operating in ambiguity (startup environment)
Strong bias toward automation over manual processes
Focus on reliability, simplicity, and clarity

Join us in building and operating a decentralized network where reliability depends not only on code, but on how systems behave in the real world.