AWS Certified Data Engineer – Associate (DEA-C01)

Introduction

What data engineers do? Data engineers:

Process raw data into valuable insights for organizations.
Design, develop and maintain data architectures.
Extract, transform and load (ETL) pipelines for data analysts, data scientists and other data subscribers to effectively analyze and access the data.

Data management challenges:

Data processing silos.
Excessive data movement.
Data duplication.

Essentially, the role of the data engineer is to get data from sources, make is useful and serve the data to stakeholders.

Data Discovery

Define business value
Identify your data consumers
Identify your data sources
Define your storage, catalog and access needs
Define your processing needs

Data engineer responsibilities:

Build and manage data infrastructure
Ingest data from various sources
Prepare the ingested data for analytics
Catalog and document curated datasets
Automated regular data workflows
Ensure data quality, security and compliance

AWS Data Services and Modern Infrastructure

Basic workflow of modern data engineering building blocks:

Ingest
- AWS DMS – load data from relation to non-relational databases.
- Amazon Data Firehouse – ingest data in real time, can decompress the data or apply transformations.
- AWS MSK – Kafka, alternative to Data Firehouse
- AWS IoT Core –
- AWS DataSync – sync data from on-premises or between AWS services. Used to transfer Hadoop clusters into AWS.
- AWS Transfer Family – enables FTP/SFTP on Amazon S3 buckets.
- Amazon Snowball – when transfer over network isn’t feasible.
Store
- Amazon S3
Catalog
- AWS Glue Data Catalog
Process
- AWS Glue
- AWS EMR – Easily run and scale Apache Spark, Hive, Presto, and other big data workloads
- Amazon Managed Service for Apache Flink (formerly: Amazon Kinesis Data Analytics)
Deliver
- Amazon Redshift – run SQL on large sets of structured data across many functional databases and datasets without moving the data.
- Amazon Athena – queries large datasets directly on Amazon S3 using standard SQL.
- Amazon EMR – run analytics frameworks like Apache Spark, Hive, Presto, and Flink on large datasets stored in AWS services like Amazon S3 and Amazon DynamoDB.
- Amazon Databases – more than fourteen purpose-built Amazon databases to store, query, and analyze large datasets, for example DocumentDB, GraphDB.
- Amazon OpenSearch
- Amazon QuickSight
- Amazon SageMaker
Security and governance
- AWS Lake Formation
- AWS IAM
- AWS KMS
- Amazon Macie
- Amazon DataZone – Amazon DataZone to catalog, discover, share, and govern data stored across AWS, on premises, and third-party sources.
- AWS Audit Manager –

Orchestration and Automation Options

Automation is suitable for simple repetitive tasks. Orchestration is needed for complex workflows involving the coordination of multiple services, teams, and dependencies across stages.

Services used for orchestration and Automation:

AWS Lambda
AWS Step Functions
Amazon Managed Workflows for Apache Airflow (Amazon MWAA)
Amazon EventBridge
Amazon SNS
Amazon SQS

Data Engineering Security and Monitoring

Security should address the following five areas:

Access management
- AWS IAM, AWS ACM
Regulatory compliance
- AWS Audit Manager, AWS Config
Sensitive data protection
- Amazon Macie, AWS KMS, AWS Glue
Data and network security
- AWS Control Tower, AWS GuardDuty, AWS WAF, AWS Shield
Data auditability
- AWS CloudTrail, AWS Lake Formation, AWS Glue Data Catalog

The following are a few key aspects that should be monitored:

Resources
Analitics Jobs
Data Pipelines
Data Access