Data Engineering Basics

DATA ENGINEERING
BASICS & GETTING STARTED

Data Engineer
 They build and scale the platforms that enable data collection, processing and
storage for data science/business analytics use.
Data Scientist
 They use linear algebra and multivariable calculus to create new insight from
existing data.

Designing, building and scaling systems that organize
data for analytics

ETL (EXTRACT,
TRANSFORM, LOAD)

Basic architecture of
ETL
Scaling factor

Raw data
 Unprocessed data in format used on source e.g JSON
 No schema applied
Processed data
 Raw data with schema applied
 Stored in event tables/destinations in pipelines
Cooked data
 Processed data that has been summarized.

Volume
 How much data you have
Velocity
 How fast data is getting to you
Variety
 How different your data is
Veracity
 How reliable your data is

STREAM PROCESSING
Process data on the fly, as it comes in

At Least Once
At Most Once
Exactly Once

MAP REDUCE
Key –Value pairing.
Organize the data into keys and values,
Sort by the key,
Combine the data with matching keys
Repeat until you have the final key- value outcome.

Relational Database (SQL)
Document Store (NoSQL)

The Data Engineering Cookbook
https://github.com/andkret/Cookbook

Data Engineering Basics

More Related Content

What's hot

Similar to Data Engineering Basics

Recently uploaded

Data Engineering Basics