DATA ENGINEERING
BASICS & GETTING STARTED
DEFINITIONS
Data Engineer
 They build and scale the platforms that enable data collection, processing and
storage for data science/business analytics use.
Data Scientist
 They use linear algebra and multivariable calculus to create new insight from
existing data.
DATA ENGINEERING
Designing, building and scaling systems that organize
data for analytics
ETL (EXTRACT,
TRANSFORM, LOAD)
Basic architecture of
ETL
Scaling factor
DATA CLASSIFICATION
Raw data
 Unprocessed data in format used on source e.g JSON
 No schema applied
Processed data
 Raw data with schema applied
 Stored in event tables/destinations in pipelines
Cooked data
 Processed data that has been summarized.
BIG DATA PROPERTIES
Volume
 How much data you have
Velocity
 How fast data is getting to you
Variety
 How different your data is
Veracity
 How reliable your data is
DATA PROCESSING
METHODS
BATCH PROCESSING
STREAM PROCESSING
Process data on the fly, as it comes in
STREAMING METHODS
At Least Once
At Most Once
Exactly Once
PROCESSING
FRAMEWORKS
MAP REDUCE
Key –Value pairing.
Organize the data into keys and values,
Sort by the key,
Combine the data with matching keys
Repeat until you have the final key- value outcome.
DATA STORAGE
Relational Database (SQL)
Document Store (NoSQL)
THANKYOU
REFERENCES
The Data Engineering Cookbook
https://github.com/andkret/Cookbook

Data Engineering Basics