The only way to prevent data quality issues is by helping data consumers and producers communicate effectively BEFORE breaking changes are deployed. To do that, we must first acknowledge the reality of modern software engineering: 1. Data producers don’t know who is using their data and for what 2. Data producers don’t want to cause damage to others through their changes 3. Data producers do not want to be slowed down unnecessarily Next, we must acknowledge the reality of modern data engineering: 1. Data engineers can’t be a part of every conversation for every feature (there are too many) 2. Not every change is a breaking change 3. A significant number of data quality issues CAN be prevented if data engineers are involved in the conversation What these six points imply is the following: If data producers, data consumers, and data engineers are all made aware that something will break before a change has deployed, it can resolve data quality through better communication without slowing anyone down while also building more awareness across the engineering organization. We are not talking about more meaningless alerts. The most essential piece of this puzzle is CONTEXT, communicated at the right time and place. Data producers: Should understand when they are making a breaking change, who they are impacting, and the cost to the business Data engineers: Should understand when a contract is about to be violated, the offending pull request, and the data producer making the change Data consumers: Should understand that their asset is about to be broken, how to plan for the change, or escalate if necessary The data contract is the technical mechanism to provide this context to each stakeholder in the data supply chain, facilitated through checks in the CI/CD workflow of source systems. These checks can be created by data engineers and data platform teams, just as security teams create similar checks to ensure Eng teams follow best practices! Data consumers can subscribe to contracts, just as software engineers can subscribe to GitHub repositories in order to be informed if something changes. But instead of being alerted on an arbitrary code change in a language they don’t know, they are alerted on breaking changes to the metadata which can be easily understood by all data practitioners. Data quality CAN be solved, but it won’t happen through better data pipelines or computationally efficient storage. It will happen by aligning the incentives of data producers and consumers through more effective communication. Good luck! #dataengineering
Best Practices in Data Engineering
Explore top LinkedIn content from expert professionals.
Summary
Mastering best practices in data engineering is crucial for creating reliable, scalable systems that ensure data quality, efficiency, and collaboration among stakeholders. These practices prioritize clear communication, simplified designs, and robust testing to manage data pipelines as software and adapt to evolving needs.
- Focus on stakeholder communication: Establish clear data contracts and align data producers, consumers, and engineers to prevent data quality issues and reduce disruptions across the organization.
- Apply modular design principles: Write clean, reusable, and testable code by creating small, single-purpose functions, ensuring idempotent operations, and separating business logic from data processing.
- Adopt proactive quality measures: Integrate continuous testing, monitoring, and documentation into workflows to catch errors early and maintain data integrity throughout your pipelines.
-
-
After 1000s of hours building data pipelines over 10 years, I'll teach you functional design principles in 5 minutes. Most data engineers write code that breaks in production. Here's why functional design will save your pipelines: 1. Write atomic functions Keep functions focused on one task. • Create single-purpose functions for each operation • Avoid mixing database connections with data processing • Split complex operations into smaller, testable units 2. Ensure idempotent operations Same inputs produce same outputs. • Use UPSERT instead of INSERT statements • Design functions that can run multiple times safely • Prevent duplicate data creation on re-runs Idempotency prevents data corruption on retries. 3. Eliminate side effects Functions shouldn't modify external state. • Pass database connections as function parameters • Avoid closing connections inside processing functions • Return outputs instead of modifying global variables Pure functions are easier to test and debug. 4. Implement dependency injection Accept external dependencies as inputs. • Pass database connections to load functions • Inject configuration objects instead of hardcoding • Use factory patterns for creating connections 5. Apply referential transparency Function behavior depends only on inputs. • Avoid reading from global state inside functions • Make all dependencies explicit through parameters • Ensure functions return consistent results 6. Use pure transformation logic Transform data without external dependencies. • Separate business logic from infrastructure code • Create transformation functions that only process data • Avoid API calls inside transformation functions Pure transformations are the easiest to unit test. 7. Design composable functions Build complex operations from simple parts. • Create small functions that work together • Use function composition for data pipelines • Build higher-order functions for common patterns Composable functions reduce code duplication and improve maintainability. 8. Handle errors functionally Return errors instead of throwing exceptions. • Use result types to handle success and failure • Return None or error objects for invalid inputs • Let calling code decide how to handle failures Functional error handling makes pipelines more robust. --- Share this with your network if it helped you build better data pipelines. How do you handle functional design in Python? Share your approach in the comments below. Follow me for more actionable content. #dataengineering #python #functionalprogramming #datapipelines #softwareengineering #coding
-
I am a Senior Data Engineer at Amazon with more than 11+ years of experience. Here are 5 pieces of advice I would give to people in their 20s, who want to make a career in Big Data in 2025: ◄ Stop obsessing over fancy tools [ Master SQL first ] - Become fluent at writing complex joins, window functions, and optimizing queries. - Deeply understand ETL pipelines: know exactly how data moves, transforms, and lands in your warehouse. - Practice schema design by modeling real datasets (think e-commerce or user analytics data). ◄ Get hands-on with cloud, not just theory - Don't just pass AWS certification exams, build projects like a data pipeline from S3 to Redshift or an automated ETL workflow using AWS Glue. - Learn Kafka by setting up a simple real-time data streaming pipeline yourself. Set up an end-to-end analytics stack: ingest real-time data, process it with Airflow, Kafka, and visualize with QuickSight or Power BI. ◄ System Design is your secret weapon - Don't memorize patterns blindly, sketch systems like a Netflix-like pipeline, complete with partitioning and indexing choices. - Practice explaining your design to someone non-technical, if you can’t, redesign it simpler. - Understand real trade-offs like when to pick NoSQL (DynamoDB) vs SQL (Postgres) clearly, with real-world reasons (transaction speed vs consistency). ◄ Machine learning isn't optional anymore - Go beyond theory: integrate real ML models into your pipelines using something like Databricks or SageMaker. - Experiment with ML-based anomaly detection, build a basic fraud detection pipeline using real public datasets. - Know basics of Feature Engineering, prepare datasets used by data scientists, don’t wait for them to teach you. ◄ Soft skills will accelerate your career - Learn to clearly communicate business impact, not just tech specs. Don’t say "latency reduced," say “users see pages load 2x faster.” - Document like your future self depends on it, clearly explain your pipelines, edge cases, and design decisions. - Speak up early in meetings, your solutions won’t matter if no one understands them or knows you created them. – P.S. I'm Shubham - a senior data engineer at Amazon. Follow me for more insights on data engineering. Repost if you learned something new today!
-
ETL Testing: Ensuring Data Integrity in the Big Data Era Let's explore the critical types of ETL testing and why they matter: 1️⃣ Production Validation Testing • What: Verifies ETL process accuracy in the production environment • Why: Catches real-world discrepancies that may not appear in staging • How: Compares source and target data, often using automated scripts • Pro Tip: Implement continuous monitoring for early error detection 2️⃣ Source to Target Count Testing • What: Ensures all records are accounted for during the ETL process • Why: Prevents data loss and identifies extraction or loading issues • How: Compares record counts between source and target systems • Key Metric: Aim for 100% match in record counts 3️⃣ Data Transformation Testing • What: Verifies correct application of business rules and data transformations • Why: Ensures data quality and prevents incorrect analysis downstream • How: Compares transformed data against expected results • Challenge: Requires deep understanding of business logic and data domain 4️⃣ Referential Integrity Testing • What: Checks relationships between different data entities • Why: Maintains data consistency and prevents orphaned records • How: Verifies foreign key relationships and data dependencies • Impact: Critical for maintaining a coherent data model in the target system 5️⃣ Integration Testing • What: Ensures all ETL components work together seamlessly • Why: Prevents system-wide failures and data inconsistencies • How: Tests the entire ETL pipeline as a unified process • Best Practice: Implement automated integration tests in your CI/CD pipeline 6️⃣ Performance Testing • What: Validates ETL process meets efficiency and scalability requirements • Why: Ensures timely data availability and system stability • How: Measures processing time, resource utilization, and scalability • Key Metrics: Data throughput, processing time, resource consumption Advancing Your ETL Testing Strategy: 1. Shift-Left Approach: Integrate testing earlier in the development cycle 2. Data Quality Metrics: Establish KPIs for data accuracy, completeness, and consistency 3. Synthetic Data Generation: Create comprehensive test datasets that cover edge cases 4. Continuous Testing: Implement automated testing as part of your data pipeline 5. Error Handling: Develop robust error handling and logging mechanisms 6. Version Control: Apply version control to your ETL tests, just like your code The Future of ETL Testing: As we move towards real-time data processing and AI-driven analytics, ETL testing is evolving. Expect to see: • AI-assisted test case generation • Predictive analytics for identifying potential data quality issues • Blockchain for immutable audit trails in ETL processes • Increased focus on data privacy and compliance testing
-
Ensuring data quality at scale is crucial for developing trustworthy products and making informed decisions. In this tech blog, the Glassdoor engineering team shares how they tackled this challenge by shifting from a reactive to a proactive data quality strategy. At the core of their approach is a mindset shift: instead of waiting for issues to surface downstream, they built systems to catch them earlier in the lifecycle. This includes introducing data contracts to align producers and consumers, integrating static code analysis into continuous integration and delivery (CI/CD) workflows, and even fine-tuning large language models to flag business logic issues that schema checks might miss. The blog also highlights how Glassdoor distinguishes between hard and soft checks, deciding which anomalies should block pipelines and which should raise visibility. They adapted the concept of blue-green deployments to their data pipelines by staging data in a controlled environment before promoting it to production. To round it out, their anomaly detection platform uses robust statistical models to identify outliers in both business metrics and infrastructure health. Glassdoor’s approach is a strong example of what it means to treat data as a product: building reliable, scalable systems and making quality a shared responsibility across the organization. #DataScience #MachineLearning #Analytics #DataEngineering #DataQuality #BigData #MLOps #SnacksWeeklyonDataScience – – – Check out the "Snacks Weekly on Data Science" podcast and subscribe, where I explain in more detail the concepts discussed in this and future posts: -- Spotify: https://lnkd.in/gKgaMvbh -- Apple Podcast: https://lnkd.in/gj6aPBBY -- Youtube: https://lnkd.in/gcwPeBmR https://lnkd.in/gUwKZJwN
-
🚨 Data professionals NEED to utilize software engineering best practices. Gone are the days of a scrappy jupyter notebook or quick SQL queries to get stuff done. 👀 While both of those scrappy methods serve a purpose, the reality is that our industry as a whole has matured where data is no longer a means to an end, but actual products with dependencies for critical business processes. 👇🏽 What does this look like? - Clean code where each action is encapsulated in a specific function and or class. - Version control where each pull request has a discrete purpose (compared to 1k+ line PRs). - Clear documentation of business logic and reasoning (on my SQL queries I would leave comments with Slack message links to show when public decisions were made). - Unit tests that test your above functions as well as the data when possible. - CI/CD on your pull requests, which is very approachable now with GitHub actions. 💻 In my LinkedIn learning course I was adamant about not just teaching you dbt, but instead how to create a dbt project that was production ready and utilized engineering best practices. Specifically, it's hands on where you will learn: - How to use the command line - How to setup databases - Utilizing requirements.txt files for reproducibility - Creating discrete PRs for building a your project - Documentation as code - Utilizing the DRY principle (don't repeat yourself) - Implementing tests on your code - Creating a dev and prod environment - Setting up GitHub Actions workflows (CI/CD) 🔗 Link to the course in the comments!
-
Want to master data engineering? Navigating data engineering can be daunting, but with the right roadmap, it becomes a structured and achievable goal. Check out this comprehensive roadmap which breaks down the essential steps to becoming a proficient data engineer. ☑ Learn Programming: Start with foundational languages like SQL, Python, Java, and Scala. These skills are crucial for writing efficient data manipulation and processing scripts ☑ Processing: Understand batch and stream processing using tools like Spark, Hadoop, Flink, and Kafka. This knowledge is key to handling large-scale data workflows ☑ Databases: Get familiar with both SQL databases (MySQL, Postgres, Oracle) and NoSQL databases (MongoDB, Cassandra, Redis). Each has its unique advantages for different data scenarios ☑ Message Queue: Learn about message queues like Kafka and RabbitMQ to handle real-time data streams and communication between services ☑ Warehouse: Dive into data warehousing solutions such as Snowflake, Hive, Redshift, Synapse, and BigQuery to manage and analyze large datasets effectively ☑ Cloud Computing: Gain expertise in cloud platforms like AWS, Azure, and GCP, which are essential for scalable and flexible data infrastructure ☑ Storage: Explore different storage solutions like HDFS, S3, ADLS, and GCS. Understanding storage options is critical for efficient data management ☑ Data Lake: Learn about data lakes with tools like Databricks and Snowflake to store vast amounts of raw data in its native format ☑ Orchestration: Master orchestration tools like Airflow and Data Factory to automate and manage complex data workflows ☑ Resource Manager: Familiarize yourself with resource managers like YARN and Mesos to efficiently allocate resources in a distributed computing environment This roadmap is your guide to systematically building a solid foundation in data engineering. Each step is designed to equip you with the skills needed to handle data effectively and derive valuable insights. Credits: Rocky Bhatia #data #dataengineering
-
The best way to improve technically as a data engineer / analyst is to treat your pipelines like software. It's almost always better to keep things as simple as possible. You'll spend more time maintaining and modifying old models rather than writing new ones. Also when you hire new folks, they need to understand the existing code base to be productive. If you keep your SQL and Python code simple, then it will be easier to read and make changes. Here are some tips to keep your data pipelines simple. 1. Create the right abstractions. It all starts with a good data model. Figure out what data you have and what questions you need to answer. Then design the tables that you need. 2. Keep tables near the root (beginning of the pipeline) join free and incremental. Usually the largest data comes in the beginning as time series events. These should be efficient and easy to maintain. Adding complex joins too early will bog down your whole pipeline. 3. Only join data when absolutely necessary and as far down as possible. Many times, dimensional data changes, like the name or address of an account. If you join it to early, then you'll need to constantly backfill large parts of your pipeline. Saving it until the end allows you to only backfill what is necessary. 4. Don't be afraid to refactor your pipelines. Business requirements change and trying to adapt pipelines to do things they weren't designed to do is how tech debt is created. When things change significantly, so do your pipelines. 5. Unit tests. Your data should have the same tests as your software. That means automatic checks to see that a fixed input has a fixed output. This is NOT the same thing as a data quality test (checking for nulls / uniqueness). Having unit tests enables you to refactor your pipelines without fear of breaking things or messing up complex business logic. If you never spend the time to improve your data pipelines, they'll inevitably become messy and unmaintainable. Business are not static, why should your data pipelines be?
-
Learning technical things beyond data pipelines will make you a better data engineer! - live servers have highly quality requirements! If a server goes down, your business dies. If a data pipeline is delayed, an analyst is sad. Learning to deal with higher stakes technical requirements will help you see how to build higher quality data pipelines! Higher quality meaning: - tested in CI/CD You should have unit and integration tests for your queries so you don’t push a bad change to your pipeline. - monitored in production Is your pipeline telemetry changing? Is skew hurting the performance? Can you make things more efficient? - documented for other engineers How do you troubleshoot when things break? Who do you talk to when quality errors arise? You’d be surprised how much full-stack engineering made me a better data engineer. There aren’t enough data engineers who care about this stuff which leads to the perception that data engineers are less technical than software engineers! #dataengineering
-
🚨 Data Engineering is Evolving—ETL is Dying 🚨 Let’s face it: traditional ETL (Extract, Transform, Load) wasn’t built for modern data engineering. Here’s why: ⚙️ Data pipelines now handle millions of events per second. 🌍 Sources are more diverse—streaming apps, APIs, IoT, cloud platforms. ⏳ Business leaders want insights in real time—not tomorrow. The old ETL approach? Too rigid, too slow, and too brittle. What’s replacing it? Purpose-built solutions designed for scale and agility. 1️⃣ ELT Over ETL: Extract and Load everything first, then transform when needed using tools like dbt for modular workflows. 2️⃣ Purpose-Built Storage Layers: Data warehouses (Snowflake, BigQuery) for structured data. Data lakes (S3, Delta Lake) for large-scale, unstructured data. 3️⃣ Event-Driven Architectures: Tools like Kafka and Pulsar ensure real-time data streams flow across distributed systems. 4️⃣ Self-Serve Data Platforms: Empower analysts with raw data and reusable transformations—cutting out bottlenecks and freeing up engineers for high-value work. The shift is clear: Modern data engineering isn’t about building monolithic pipelines—it’s about building modular, purpose-built systemsthat fit the problem. 🚫 No more waiting for overnight batch jobs. 🚫 No more one-size-fits-all transformations. If your pipelines aren’t built to adapt in real-time, you’re already falling behind. 💡 Question for you: What purpose-built tools have transformed your data stack? Let’s swap ideas below 👇 #DataEngineering #ETL #ELT #DataPipelines #PurposeBuiltSolutions #ModernDataStack