AWS Certified Data Engineer – Associate (DEA-C01)

Introduction

What data engineers do? Data engineers:

  • Process raw data into valuable insights for organizations.
  • Design, develop and maintain data architectures.
  • Extract, transform and load (ETL) pipelines for data analysts, data scientists and other data subscribers to effectively analyze and access the data.

Data management challenges:

  • Data processing silos.
  • Excessive data movement.
  • Data duplication.

Essentially, the role of the data engineer is to get data from sources, make is useful and serve the data to stakeholders.

Data Discovery

  • Define business value
  • Identify your data consumers
  • Identify your data sources
  • Define your storage, catalog and access needs
  • Define your processing needs

Data engineer responsibilities:

  • Build and manage data infrastructure
  • Ingest data from various sources
  • Prepare the ingested data for analytics
  • Catalog and document curated datasets
  • Automated regular data workflows
  • Ensure data quality, security and compliance

AWS Data Services and Modern Infrastructure

Basic workflow of modern data engineering building blocks:

Modern Workflow
  1. Ingest
    • AWS DMS – load data from relation to non-relational databases.
    • Amazon Data Firehouse – ingest data in real time, can decompress the data or apply transformations.
    • AWS MSK – Kafka, alternative to Data Firehouse
    • AWS IoT Core –
    • AWS DataSync – sync data from on-premises or between AWS services. Used to transfer Hadoop clusters into AWS.
    • AWS Transfer Family – enables FTP/SFTP on Amazon S3 buckets.
    • Amazon Snowball – when transfer over network isn’t feasible.
  2. Store
    • Amazon S3
  3. Catalog
    • AWS Glue Data Catalog
  4. Process
    • AWS Glue
    • AWS EMR – Easily run and scale Apache Spark, Hive, Presto, and other big data workloads
    • Amazon Managed Service for Apache Flink (formerly: Amazon Kinesis Data Analytics)
  5. Deliver
    • Amazon Redshift – run SQL on large sets of structured data across many functional databases and datasets without moving the data.
    • Amazon Athena – queries large datasets directly on Amazon S3 using standard SQL.
    • Amazon EMR – run analytics frameworks like Apache Spark, Hive, Presto, and Flink on large datasets stored in AWS services like Amazon S3 and Amazon DynamoDB.
    • Amazon Databases – more than fourteen purpose-built Amazon databases to store, query, and analyze large datasets, for example DocumentDB, GraphDB.
    • Amazon OpenSearch
    • Amazon QuickSight
    • Amazon SageMaker
  6. Security and governance
    • AWS Lake Formation
    • AWS IAM
    • AWS KMS
    • Amazon Macie
    • Amazon DataZone – Amazon DataZone to catalog, discover, share, and govern data stored across AWS, on premises, and third-party sources.
    • AWS Audit Manager –

Orchestration and Automation Options

Automation is suitable for simple repetitive tasks. Orchestration is needed for complex workflows involving the coordination of multiple services, teams, and dependencies across stages.

Services used for orchestration and Automation:

  • AWS Lambda
  • AWS Step Functions
  • Amazon Managed Workflows for Apache Airflow (Amazon MWAA)
  • Amazon EventBridge
  • Amazon SNS
  • Amazon SQS

Data Engineering Security and Monitoring

Security should address the following five areas:

  • Access management
    • AWS IAM, AWS ACM
  • Regulatory compliance
    • AWS Audit Manager, AWS Config
  • Sensitive data protection
    • Amazon Macie, AWS KMS, AWS Glue
  • Data and network security
    • AWS Control Tower, AWS GuardDuty, AWS WAF, AWS Shield
  • Data auditability
    • AWS CloudTrail, AWS Lake Formation, AWS Glue Data Catalog

The following are a few key aspects that should be monitored:

  • Resources
  • Analitics Jobs
  • Data Pipelines
  • Data Access