Buch, Englisch, 384 Seiten, Format (B × H): 183 mm x 229 mm, Gewicht: 748 g
Buch, Englisch, 384 Seiten, Format (B × H): 183 mm x 229 mm, Gewicht: 748 g
ISBN: 978-1-394-32541-2
Verlag: Wiley
A hands-on technical and industry roadmap for aspiring data engineers
In Data Engineering for Beginners, big data expert Chisom Nwokwu delivers a beginner-friendly handbook for everyone interested in the fundamentals of data engineering. Whether you're interested in starting a rewarding, new career as a data analyst, data engineer, or data scientist, or seeking to expand your skillset in an existing engineering role, Nwokwu offers the technical and industry knowledge you need to succeed.
The book explains: - Database fundamentals, including relational and noSQL databases
- Data warehouses and data lakes
- Data pipelines, including info about batch and stream processing
- Data quality dimensions
- Data security principles, including data encryption
- Data governance principles and data framework
- Big data and distributed systems concepts
- Data engineering on the cloud
- Essential skills and tools for data engineering interviews and jobs
Data Engineering for Beginners offers an easy-to-read roadmap on a seemingly complicated and intimidating subject. It addresses the topics most likely to cause a beginning data engineer to stumble, clearly explaining key concepts in an accessible way. You'll also find: - A comprehensive glossary of data engineering terms
- Common and practical career paths in the data engineering industry
- An introduction to key cloud technologies and services you may encounter early in your data engineering career
Perfect for practicing and aspiring data analysts, data scientists, and data engineers, Data Engineering for Beginners is an effective and reliable starting point for learning an in-demand skill. It's a powerful resource for everyone hoping to expand their data engineering Skillset and upskill in the big data era.
Autoren/Hrsg.
Fachgebiete
Weitere Infos & Material
Foreword xxi
Introduction xxiii
Chapter 1 Understanding Data 1
A Brief History of Data 2
Data in 19,000 bce: The Great Baboon and Abacus 2
Data in the 1600s: Public Health Statistics 2
Data in the 1800s: The U.S. Census 3
Data in the 1900s: The Concept of Storage 3
Data in the 1990s: Data and the Internet 4
Types of Data 4
Structured Data 4
Unstructured Data 5
Semi-structured Data 6
Why Is Data Important? 7
Healthcare 7
Supply Chain 8
Transportation and Logistics 8
Artificial Intelligence 9
Data and Information 9
Summary 10
Notes 11
Chapter 2 Introduction to Data Engineering 13
Data Engineering Explained Using an Oil Refinery Analogy 14
An Overview of the Data Engineering Life Cycle 15
Data Storage 16
Data Ingestion 20
Data Transformation 21
Data Serving 22
Navigating Project Requirements, Engaging Stakeholders, and Delivering Business Value 24
Requirements Gathering 24
Understanding Stakeholders 24
Understanding System Requirements 26
Delivering Business Value 28
The Current State of Data Engineering 28
The Importance of Data Engineering 29
Summary 30
Chapter 3 Database Fundamentals 33
Key Concepts of Databases 34
Rows 34
Columns 34
Schema 35
Keys 35
Types of Databases 35
Relational Databases 36
NoSQL Databases 47
Choosing Between Relational and NoSQL Databases 55
Start With Your Data’s Structure 55
Think About the Relationships in Your Data 55
How Fast Do You Need to Move? 55
How Do You Need to Query Your Data? 55
Scaling and Performance 56
Transaction and Strong Consistency Needs 56
Summary 56
Chapter 4 SQL Fundamentals 59
Introduction to SQL 60
Basic SQL Clauses 60
Comparison Operators 62
LIKE Statement 63
IN Statement 64
BETWEEN Statement 64
AND Statement 65
OR Statement 65
NOT Statement 66
IS NULL and IS NOT NULL Statements 66
Sorting and Limiting 67
Aggregate Functions 68
Sum() 69
Avg() 69
MAX() and MIN() 69
Group by 70
Having 71
Understanding Joins 72
Inner Join 72
Left Join 73
Right Join 74
Full Outer Join 75
Subqueries 76
Common Table Expressions (CTEs) 77
Set Operations 78
Window Functions 80
Lab: Setting Up SQL Server and Running SQL Queries 85
Best Practices for Writing Efficient SQL Queries 87
Summary 88
Chapter 5 Database Design 91
Data Modeling 92
Why Do We Need to Model Data? 92
Types of Data Modeling 93
Normalization 100
Rules of Normalization 102
Downsides of Normalization 109
Denormalization 110
Data Modeling Best Practices 111
Define the Grain 111
Normalize Now, Denormalize Later 112
Choose the Right Data Types 112
Proper Naming Conventions 113
Database Optimization 114
Indexing 114
Partitioning 115
Sharding 116
Views 118
Summary 120
Chapter 6 Data Warehouses, Data Lakes, and Data Lakehouses 123
Data Warehouses 124
Extract, Transform, and Load (ETL) 126
Schema Design 127
Snowflake Schema 132
Slowly Changing Dimensions 134
Data Marts 138
Benefits of a Data Mart 138
Challenges with Data Marts 138
Data Lakes 139
How Do Data Lakes Work? 139
Challenges of Data Lakes 142
Data Lakehouse 142
Features of a Data Lakehouse 143
Data Lakehouse Architecture 143
The Key Differences Between a Database, Data Warehouse, Data Lake, and Data Lakehouse 144
Summary 145
Chapter 7 Data Pipelines 147
Batch Pipelines 148
Components of a Batch Pipeline 148
ETL Pipelines vs. ELT Pipelines 151
Stream Pipelines 152
How Would This Work? 152
Components of a Streaming Data Pipeline 153
Lambda Architecture 164
Components of the Lambda Architecture 165
Advantages of the Lambda Architecture 166
Challenges and Trade-offs 166
Data Orchestration 167
Directed Acyclic Graphs (DAGs) 168
Scheduling and Automation 170
Monitoring 171
Alerts 172
Lab: Building an ETL Pipeline and Automating with Apache Airflow 173
Requirements 174
Set Up Your Development Environment 174
Extracting Data from CSV 176
Transforming the Data 177
Load the New CSV File into a Postgres Database Instance 181
Schedule ETL Pipeline with Apache Airflow 182
Summary 185
Chapter 8 Data Quality 187
Bad Data 188
Dimensions of Data Quality 190
Accuracy 191
Completeness 191
Consistency 194
Validity 195
Uniqueness 196
Timeliness 198
Accessibility 198
Relevance 198
Data Quality Hierarchy 199
Data Quality Best Practices 200
Summary 201
Chapter 9 Data Security 203
What Is Data Security? 204
Common Threats to Data Security 205
Core Principles of Data Security 206
Confidentiality 206
Integrity 207
Availability 208
Data Encryption 209
Symmetric Encryption 209
Asymmetric Encryption 210
Data Masking 211
Understanding Network Security 214
Access Control 216
Authentication 217
Authorization 219
The Principle of Least Privilege 222
Access Levels 224
Secrets Management 225
Data Security and Data Privacy 225
Summary 226
Chapter 10 Data Governance 229
How to Think About Data Governance 230
Data Governance Framework 232
Policies 233
Regulatory Compliance Policy 234
Data Classification Policy 238
Data Retention and Disposal Policy 239
Data Sharing Policy 240
Processes 241
Metadata Management 242
Data Lineage 244
Incident Management 244
Master Data Management 246
Roles in the Data Governance Framework 247
Data Owner 248
Data Steward 248
Data Custodian 249
Chief Data Officer (CDO) 249
Data Management and Data Governance 250
Summary 250
Chapter 11 Big Data and Distributed Systems 253
The Five V’s of Big Data 254
Volume 255
Velocity 255
Variety 255
Veracity 256
Value 256
Distributed Systems 256
Scalability 258
Fault Tolerance 259
Reliability 260
Concurrency 260
Resource Management 260
Consistency 261
Availability 261
Load Balancing 261
Latency 262
Distributed Data Processing 262
Apache Hadoop 262
Big Data File Types 272
Avro 272
Parquet 273
Optimized Row Columnar (ORC) 274
Choosing the File Type 275
Summary 276
Chapter 12 Data Engineering on the Cloud 279
Cloud Computing 280
On-Premises 281
Cloud 281
Making the Right Choice 282
Core Cloud Concepts 282
Storage 282
Compute 286
Networking 287
Cloud Service Models 291
Infrastructure as a Service 291
Platform as a Service 292
Software as a Service 293
Choosing Between IaaS, PaaS, and SaaS 294
A Hybrid Approach 298
Cloud Management Models 298
Serverless 299
Managed 300
Self-Managed 301
Putting It All Together 302
Cost Optimization 302
Understanding Cloud Pricing Models 302
Rightsizing Resources 303
Smart Job Scheduling 304
Storage Optimization 304
Shutting Down Idle Resources 304
Use Serverless Where Possible 304
Monitoring and Alerting 305
Summary 305
Chapter 13 Building a Career in Data Engineering 307
Types of Data Engineering Roles 308
Types of Data Engineers 308
Platform Data Engineer 308
Analytics Data Engineer 310
AI/ML Data Engineers 310
Landing Your First Data Engineering Role 312
A Typical Data Engineering Job Description 312
How to Build a Winning Résumé 314
Preparing for a Data Engineering Interview 316
Thinking Like a Data Engineer 321
Think in Systems 321
Learn to Prioritize Data Quality 321
Design for Failure 321
Balance Business Context with Technical Choices 322
Optimize for Clarity, Then Speed 322
Think Beyond the Tool 322
Master Automation 322
Summary 323
Appendix Sample Interview Questions 325
SQL 325
Data Modeling 328
Data Pipelines 330
Apache Spark 332
System Design 333
Data Engineering Glossary 335
Index 347




