It’s Time for Data Reliability Engineering – The New Stack


Kyle Kirwan

Kyle Kirwan is the CEO and co-founder of Bigeye. Prior to founding Bigeye, he was the first product manager of the metadata team at Uber, where he launched their data catalog product Databook and led the development of their internal lineage, freshness and data quality.

There was a time when the software was not reliable enough. In this 20 year old MIT Technology Review article, a software engineer laments that good software “is usable, reliable, flawless, cost effective and maintainable. And software is none of that anymore. Fast forward two decades and businesses are run on software ranging from payment software to CRM and everything in between.

As data evolves from a nice-to-have item to something companies rely on to create customer experience and drive revenue, data must undergo a similar evolution – and there are lessons that can be learned. of the work already done by the software engineering pioneers who came before.

Borrowing from the principles of Site Reliability EngineeringData Reliability Engineering gives a name to the work of improving data quality, maintaining data timeliness, and ensuring that analytics and machine learning products are fed with a healthy set of inputs.

This work is done by data engineers, data scientists, and analytics engineers who historically haven’t had the mature tools and processes at their disposal that modern software engineering and DevOps the teams are already enjoying it. Thus, data reliability work today typically involves more data spot checks, late night initiating populates, and manually deploying SQL monitoring in Grafana than scalable and repeatable processes such as monitoring and incident management.

Under the name of Data Reliability Engineering (DRE), some Data teams are starting to change that by borrowing from SRE and DevOps.

Why is this happening now?

Data quality writ large has been a topic for decades, but it’s gotten a lot more attention in the past two years. This is driven by a few trends coming together at the same time.

  1. Data is used in ever-increasing impact applications: Support chatbots, product recommendations, inventory management, financial planning and more. These data-driven applications promise big efficiencies, but they can also incur costs to the business in the event of a data outage. As companies push for higher and higher use cases, the return on investment is increasingly important, increasing the demand for quality and reliability.
  2. Humans are less in the loop: Streaming data, machine learning models that recycle on regular schedules, self-service dashboards, and other applications reduce the number of humans in the loop. This means that pipelines should be more reliable by default, because there is no longer an analyst or data scientist checking the data on an ad hoc basis – and there shouldn’t be, they have work to do!
  3. There are not enough data engineers for everyone: Hiring data engineers is difficult and expensive. The demand for talent is exploding while the supply of people capable of building and scaling complex data platforms has not kept pace. This puts immense pressure on these teams to be resource efficient, avoid reactive firefighting – anything that automates problem detection and resolution, and especially tools or practices that help prevent problems first.

Where does DRE stop and where does DataOps start?

Data reliability engineering is part of data operations (DataOps), but only part of it. DataOps refers to the larger set of all operational challenges that owners of data platforms will face. These challenges cover issues such as data discovery and governance, cost tracking and management, access controls, and managing an ever-increasing number of queries, dashboards, and ML features and models. .

To draw a parallel with DevOps, reliability and availability are certainly challenges that many DevOps teams are responsible for, but they are often also tasked with other aspects such as developer speed and security considerations.

DRE tools and techniques

While the ink has not dried on the best tools and practices for data reliability engineering, the seven core concepts in Google’s SRE playbook create a solid foundation for data teams to work on.

  1. Take the risk : It is an inevitable fact that Something will eventually fail. Teams should plan to detect, control, and mitigate failures that do occur, rather than hoping that they can one day achieve perfection.
  2. Monitor everything: Problems cannot be controlled and mitigated if they cannot be detected. Monitoring and alerts give teams the visibility they need to understand when something is wrong and how to fix it. Observability tools are a mature area for infrastructure and applications, but for data, it’s still an emerging space.
  3. Set standards: Is the data of high quality or not? It’s a subjective matter that needs to be defined, quantified and agreed upon for teams to move forward. If the definition of good or not good is blurry or misaligned, it will be difficult to fix it. SLI, SLO and SLA are the standardization tools that can be adapted from SRE-land to DRE-land.
  4. Reduce labor:Labor” is a word that describes the human work required to make your system work – operational work – as opposed to engineering work that improves the system. Examples: launching an Airflow task or manually updating a schematic. For effective data reliability engineering, it’s worth removing as much work as possible to reduce overhead. For example, tools like Fivetran can reduce data ingestion work, and Looker training sessions can reduce the work of responding to BI requests.
  5. Use automation: The complexity of the data platform has increased exponentially and its manual management increases linearly with the workforce. Which is expensive and untenable. Automating manual processes helps data teams scale up their reliability efforts, freeing up brainpower and time to address higher-order issues.
  6. Control versions: Making changes is ultimately how things get better, but also how things break. This is a lesson that data teams can borrow quite directly from SRE and DevOps, code review, and CI/CD pipelines. After all, pipeline code is still code at the end of the day.
  7. Keep it simple: The enemy of reliability is complexity. Complexity can’t be completely eliminated – the pipeline is doing something to the data after all – but it can be reduced. Minimizing and isolating the complexity of a pipeline job goes a long way in keeping it reliable

The future of DRE

Data reliability engineering is a very young concept and many companies are help define the tools and practices that will make DRE as effective as SRE and DevOps. If you want to explore the concepts, Data Reliability Engineering Conference is a good starting point. The first event took place in December with over a thousand attendees and speakers from across the industry, including Looker, dbt, Figma, Datadog and Netflix.

Characteristic picture via Pixabay.


About Author

Comments are closed.