Senior Site Reliability Engineer - Toronto, Ontario
1 day ago

Job description
At BuildOps, we're building a software platform that empowers today's commercial contractors. From service management to project execution, we're reimagining how our customers operate. Our team thrives on ambition, innovation, and collaboration – qualities we look for in every new hire.
You will join our cloud infrastructure and reliability engineering team as a Site Reliability Engineer (SRE). Your primary responsibility will be to improve and protect the reliability, performance, and operability of our production systems while helping evolve our AWS-based infrastructure. We're looking for someone with a strong SRE mindset, solid software engineering fundamentals, and deep observability expertise who can work effectively in a distributed team environment.
Reporting to the DevOps and SRE Manager, this is a hands-on role where you will influence reliability strategy, build tooling and automation, and contribute directly to day-to-day operations in a fast-moving, industry-defining company.
What You'll Do
- Drive and refine modern SRE practices across services, including SLIs/SLOs, error budgets, and reliability reviews
- Design and maintain end-to-end observability (metrics, logs, traces, dashboards, and alerts) so teams can quickly detect, debug, and prevent issues
- Partner with product and engineering teams to design reliable services—reviewing architectures, failure modes, rollout strategies, and capacity/latency considerations
- Help evolve and operate our AWS infrastructure (networking, compute, data stores) using Infrastructure as Code (Terraform)
- Contribute code to services, tooling, and automation (for example, reliability libraries, deployment and incident tooling, health checks)
- Define, implement, and iterate on SLIs, SLOs, and error budgets with service owners, and use them to guide reliability work and release decisions
- Participate in incident response for infrastructure-related production issues, including learning-focused post-incident reviews and follow-through on action items
- Develop runbooks, safeguards, and automation that reduce manual work, improve time-to-diagnosis, and standardize responses to recurring scenarios
- Advocate for and implement security and compliance best practices in production environments
- Document standards, playbooks, and best practices so reliability improvements scale across teams
- Collaborate closely with software engineers, product managers, and other stakeholders to plan and deliver reliability-focused initiatives
What We Look For
- 5+ years of professional experience in Site Reliability Engineering, DevOps, Infrastructure Engineering, or production-focused Software Engineering, working on production systems and reliability-focused initiatives
- Proven experience leading multi-sprint, multi-engineer projects (for example, reliability, performance, or infrastructure initiatives) to successful completion with clear business impact
Thorough understanding of, and hands-on experience with, modern SRE practices, such as:
- Defining and implementing SLIs/SLOs and error budgets
- Reducing toil through automation
- Safe deployment and rollout patterns
Structured post-incident reviews and continuous improvement
Software engineering experience: you've written and maintained production-quality code and can work comfortably in at least one modern language (for example, Python or )
- Strong interest in, and experience with, using LLMs and AI-assisted tooling in your workflow, including the ability to validate and improve what they generate
Strong observability skills, including:
- Designing metrics, logging, and tracing for multi-service systems
- Building actionable dashboards and alerts with clear runbooks
- Correlating metrics, logs, and traces to debug complex issues
- Experience with tools such as Datadog, Prometheus, Grafana, Honeycomb, or New Relic (we use Datadog, but vendor-agnostic experience is welcome)
- Experience working with AWS in production and with core platform primitives such as Terraform-based Infrastructure as Code and container/orchestration platforms (for example, Docker with ECS, EKS, or Kubernetes)
Incident management experience is a strong plus, including:
- Participating in or coordinating incident response
- Working within an incident management tool (for example, , PagerDuty, Opsgenie, or similar)
Helping teams implement durable, high-leverage follow-ups
Demonstrated experience in engineering-led reliability practices (for example, infrastructure as code, automated deployments, observability-driven operations) rather than primarily manual server administration
- Strong communication skills and the ability to explain complex technical topics to both technical and non-technical audiences
- CS degree or equivalent experience running production systems; we are equally interested in people from non-traditional backgrounds who have spent time operating real-world environments
- Ability and willingness to participate in a production on-call rotation
- Ability to work a hybrid schedule – Monday/Friday WFH; Tuesday–Thursday in-office
Compensation
- $151,000 - $190,000 CAD base salary range + annual bonus
What we offer:
- Generous equity grant, become an owner in our company
- Macbook computer provided
- A comprehensive benefits package
- Flexible PTO and hybrid work schedules
- Work from home stipend
- Hubs in Los Angeles, San Francisco, Toronto, and Raleigh with hybrid work schedules and lunch provided for in-office days
- Company events like BBQs and team-building activities, both in-person and virtual
- Fast-paced, collaborative, and dynamic work environment
- Opportunities for growth and career advancement
- Chance to work with cutting-edge technology and innovative solutions
- The chance to get in on the ground floor and build something truly groundbreaking for ourselves and our amazing customers
We welcome applicants from across the U.S. where we are registered to do business and able to support employment. Currently, this excludes the following states: Alabama, Alaska, Connecticut, Hawaii, Kentucky, Mississippi, Nebraska, New Mexico, North Dakota, Rhode Island, South Dakota, West Virginia, and Wyoming. This list is based solely on operational and compliance considerations and is reviewed from time to time as our footprint grows.
About BuildOps
Join BuildOps, the largest commercial trade platform in the country, as we transform the multi-billion dollar commercial contracting industry
We're not just talking incremental improvements—we're talking a full-scale revolution, empowering the hardworking heroes who build and maintain the infrastructure that keeps our world running. See why contractors choose Buildops here.
This is your chance to be part of a rocketship. We're fresh off a $1 billion valuation and a $127M Series C funding round (part of over $275M raised to date) led by industry-leading investors like Meritech Capital, BOND, and SE Ventures, backed by Schneider Electric (Reuters, TechCrunch, LA Business Journal) . Our latest investors join our team of industry heavyweights like Next47, former Twitter CEO Dick Costolo, former Salesforce President Gavin Patterson, and Boost Mobile CEO Stephen Stokols. Their investment is fueling our aggressive growth and our commitment to equipping contractors with AI-driven tools to conquer chaos, boost efficiency, skyrocket profitability, and ultimately, deliver exceptional service.
At BuildOps, we're changing the game and doing the best work of our careers. You'll be a key player in a company that's truly making a difference for the backbone of our economy. If you're ready to tackle big challenges, work with a passionate team, and build something extraordinary, BuildOps is the place for you.
Similar jobs
Tata Consultancy Services (TCS) is an equal opportunity employer and embraces diversity in race, nationality, ethnicity, gender, age and sexual orientation to create a workforce that reflects the societies they operate in. · Good years of relevant application production support e ...
1 week ago
We are looking for a Site Reliability Engineer to ensure the reliability availability and performance of enterprise platforms through strong observability monitoring and incident management practices. · ...
3 days ago
A long-standing privacy-focused technology company operating large-scale global infrastructure is hiring a Site Reliability Engineer to help strengthen and scale its VPN and DNS platforms. · ...
1 week ago
The Site Reliability Engineer will ensure the reliability and availability of software systems by designing resilient architectures, automating infrastructure management, and implementing effective incident response processes. · ...
1 month ago
We are seeking a Site Reliability Engineer to ensure the reliability performance and scalability of our systems. · ...
4 days ago
The TeamGlobal Banking and Markets Engineering (GBME) is the fast-moving, award-winning technology engine that powers Scotiabank's Corporate, · In this exciting role, you'll apply your analytical skills to design and develop applications that deliver excellence,effectiveness,and ...
3 weeks ago
+As a Site Reliability Engineer (SRE), you will join the Digital Engineering Operations team responsible for ensuring the operations and reliability of Scotiabank digital applications. · +Gather and refine specifications and requirements based on technical needs. · Create an envi ...
1 month ago
We are seeking a Site Reliability Engineer IV to support trade finance change initiatives and BAU operations within a large financial technology environment. · Strong understanding of system and system integration patterns (REST APIs, MQ, etc.) · Advanced technical troubleshootin ...
1 month ago
We are seeking a skilled Site Reliability Engineer (SRE) to enhance the reliability scalability performance our systems applications The ideal candidate will have strong experience in automation cloud platforms observability incident management DevOps practices This role involves ...
1 month ago
The Site Reliability Engineer will ensure the reliability and availability of software systems by designing resilient architectures, automating infrastructure management, and implementing effective incident response processes. · ...
4 days ago
Are you passionate about ensuring the reliability and performance of large-scale distributed systems? This role calls for a Site Reliability Engineer (Dynatrace Specialist), who will play a crucial part in maintaining and enhancing the observability, stability, and efficiency of ...
4 weeks ago
This is a position for a Site Reliability Engineer to ensure the reliability performance scalability of our systems. · Monitor and maintain the health performance availability of our systems services. · ...
1 month ago
We're seeking an experienced Site Reliability Engineer (SRE) to join our team, focusing on designing, implementing and maintaining scalable CI/CD pipelines. · Design, implement and maintain scalable CI/CD pipelines using tools like Jenkins, Argo and GitOps. · Collaborate with dev ...
1 month ago
We're looking for Site Reliability Engineers to join our transformative cloud journey. · iManage Means... · You'll create middleware and platform guardrails that empower developers to innovate quickly and reliably. You'll combine technical depth with empathy to eliminate customer ...
1 month ago
The DevOps and Automation department is looking for a Site Reliability Engineer with strong expertise in Dynatrace to ensure the reliability, performance and observability of large scale, distributed systems. · Monitoring application flow (transactions) to check on anomalies and ...
4 weeks ago
+As a Site Reliability Engineer (SRE), you will join the Digital Engineering Operations team, responsible for ensuring the operations and reliability of Scotiabank digital applications. · +Gather and refine specifications and requirements based on technical needs. · Direct day-to ...
1 month ago
We are looking for a Site Reliability Engineer to join our team. The successful candidate will be responsible for designing and building automation for infrastructure provisioning and configuration management. · Automation & Configuration Management: Strong experience with Ansibl ...
1 month ago
· As a Site Reliability Engineer (GCP) you will play a key role at Stacktics Inc., where we design, create, deploy, maintain and grow industry-leading Cloud Infrastructure, Big Data Analytics and Cloud For Marketing products, solutions and services. As a SRE/DevOps team member, ...
2 days ago
We're not just building better tech. We're rewriting how data moves and what the world can do with it. · It takes a certain kind of person to join this team. Those who ask hard questions, give honest feedback, No egos, · No solo acts. · ...
1 month ago
The DevOps and Automation team is looking for a Site Reliability Engineer with strong expertise in Dynatrace to ensure the reliability, performance and observability of large scale, distributed systems. · ...
4 weeks ago
As a Site Reliability Engineer at Scotiabank you will implement measure and gather insights from Operational Level Indicators identifying areas for service improvements covering availability performance resilience incidents and chronic problems. · You want to be challenged with c ...
1 week ago