AWS Certified Data Analytics Certificate Theory Summary
1. Base
Organize missing content based on the following organized content
2. Collection
RealTime : Real-time data collection
- Kinesis Data Streams (KDS)
- Simple Queue Service (SQS)
- Internet of Things (IoT)
Near-real Time : Near-real-time data collection
- Kinesis Data Firehose (KDF)
- Database Migration Service (DMS)
Batch : Batch data collection
- Snowball
- Data Pipeline
2.1. Kinesis Data Streams
- Composed of multiple shards
- Retention : 1 ~ 365 Days
- Stored data cannot be deleted
- Producer
- Record
- Data sent by producer
- Partition Key : Determines which shard the record is delivered to
- Data Blob : Data storage
- Consumer
- Capacity Mode
- Provisioned Mode
- User specifies the number of shards
- Cost per shard
- On-demand Mode
- Automatically scales based on traffic volume
- Default performance : 4 MB/sec, 4000 msg/sec
- Cost based on shard count and traffic volume
- Provisioned Mode
- Security
- IAM-based authentication/authorization
- HTTPS used for data encryption in transit
- KMS used for data encryption at rest
- Accessible via VPC Endpoint within VPC
- Tracking through CloudTrail
2.1.1. Kinesis Producer
- Ex) Application, SDK KPL, Kinesis Agent, CloudWatch Logs, AWS IoT, Kinesis Data Analytics
- Performance
- 1 MB/sec, 1000 msg/sec limit per shard
- ProvisionedThroughputExceeded Exception occurs when exceeded
- Need to check if sending more data or if hot shard is occurring
- Resolve by retrying with backoff, increasing shards, checking partition key
- API
- Single : PutRecord
- Multiple : PutRecords
- Kinesis Producer Library (KPL)
- Supports C++/Java
- Retry logic support
- Holds data for up to 30 minutes and retries
- Supports Sync and Async API
- Sends metrics to CloudWatch
- Performs batching
- Increases throughput, reduces cost
- Waits for RecordMaxBufferedTime duration then sends at once (Default 100ms)
- Latency occurs compared to using Write API directly, so KPL is not recommended for applications where latency is important
- Does not provide compression, needs to be implemented in app
- Records encoded with KPL must be decoded through KPL or Helper Library
- Kinesis Agent
- Sends log files to Kinesis Data Streams
- Based on KPL
- Supports data preprocessing
2.1.2. Kinesis Consumer
- Ex) Application, AWS Lambda, Kinesis Data Firehose, Kinesis Data Analytics
- Performance
- Default : 2 MB/sec for all consumers
- With Enhanced Fan Out : 2 MB/sec per consumer
- API
- GetRecords
- Retrieves multiple records
- Requires client polling
- Can receive up to 2 MB data per shard per call
- Can receive up to 10 MB data, 10000 records per call
- Throttled for 5 seconds when receiving 10 MB due to performance limits
- Can access up to 5 times per second per shard
- GetRecords
- Kinesis Client Library (KCL)
- Golang, Python, Ruby, NodeJs
- Multiple consumers can distribute and process multiple shards through group functionality
- Can resume data processing through checkpointing functionality
- Performs de-aggregation of data aggregated through KPL
- Kinesis Connector Library
- Delivers data to other AWS services
- Needs to run on EC2 instance
- Deprecated : Replaced by Kinesis Firehose
- Lambda
- De-aggregated and delivered to Lambda function
- Can specify batch size
2.1.3. Kinesis Scaling
- Adding Shards
- When adding shards, multiple child shards are created replacing existing parent shard
- Recommended to process data in child shards after processing all data in parent shard
- Implemented internally in KCL
- Merging Shards
- When merging shards, one child shard is created replacing multiple existing parent shards
2.1.4. Duplicated Records
- Producer
- Due to temporary network failures, the same record may be duplicated and stored in stream
- Cannot avoid duplicate storage, needs duplicate processing at consumer
- Consumer
- Add unique ID to data, process app logic so that duplicate processing does not occur even when receiving the same data
2.2. Kinesis Data Firehose
- Loads data to AWS services and 3rd party applications
- Fully Managed Service : Supports auto-scaling
- Near Real Time : Minimum 60 second delay occurs
- Compression support : GZIP, ZIP, SNAPPY
- Producer
- SDK KPL, Kinesis Agent, Kinesis Data Streams, Amazon CloudWatch, AWS IoT
- Maximum 1MB per record
- Consumer
- Amazon S3, Amazon OpenSearch, Datadog, Splunk, New Relic, MongoDB, HTTP Endpoints
- Performs batch writes
- Transformation
- Can transform data using Lambda
3. Processing
3.1. Glue
- Performs serverless ETL
- Can process data from S3, RDS, Redshift
- Glue Crawler
- Generates schema from S3 data
- Generated schema is stored in Glue Data Catalog
- Glue Data Catalog
- Schema information storage
- Can be created through user input or Glue Crawler
- Can convert EMR Hive metastore to Glue Data Catalog
- Can be provided as Hive metastore from EMR Hive
- Glue Studio : Processes Glue jobs through visual interface
- Glue Data Quality
- Service for data quality evaluation and inspection
- Defines rules using DQDL (Data Quality Definition Language)
- Available in Glue Studio
3.2. Glue DataBrew
3.3. Lake Formation
- Manages data access permissions for Data Lake (S3)
- Provides data monitoring functionality
- Provides data transformation functionality
- Based on Glue
3.4. Amazon Security Lake
- Centralizes data security
- Normalizes data including multi-account and on-premise environments
- Manages data lifecycle
3.5. EMR
- Managed Hadoop framework based on EC2
- Spark, HBase, Presto, Flink
- EMR Notebooks
3.5.1. EMR Cluster
- Master Node
- Tracks task status, manages cluster status
- Core Node
- Provides HDFS, performs tasks
- Must have at least one core node in multi-node cluster
- Because HDFS is still used in some Hadoop ecosystem components
- Can scale up & down but data loss risk exists
- Task Node
- Performs tasks
- No data loss risk when removed since it does not store data like HDFS
- Suitable for Spot instances
- Transient vs Long-Running Cluster
- Transient Cluster
- Removes cluster after task completion
- Suitable for batch jobs
- Long-running Cluster
- Performs stream tasks using RI instances
- Performs batch tasks using Spot instances
- Transient Cluster
- Task Submission
- Submit tasks by directly accessing master node
- Submit tasks through AWS Console
3.5.2. EMR with AWS Services
- Stores input data and output data in S3
- Performs performance monitoring through CloudWatch
- Manages authorization through IAM
- Performs audit through CloudTrail
- Configures task scheduling and workflow through AWS Data Pipeline and AWS Step Functions
3.5.3. EMR Storage
- HDFS
- Composed of core node cluster
- Same lifecycle as EMR cluster, HDFS data is also deleted when EMR cluster is removed
- Recommended for caching temporary data due to faster performance than EMRFS
- EMRFS
- S3-based filesystem
- Separate lifecycle from EMR cluster, data preserved even when EMR cluster is removed
- Guarantees strong consistency in S3
- Local Filesystem
- Used for temporary cache
3.5.4. EMR Managed Scaling
- EMR Automatic Scaling
- Based on CloudWatch metrics
- Only supports instance groups
- EMR Managed Scaling
- Supports instance groups and instance fleets
- Supports Spark, Hive, YARN workloads
3.5.5. EMR Serverless
- Supports Spark, Hive, Presto
- Automatically manages cluster size
- Users can specify size
- Processes tasks only within one region
3.5.6. EMR Spark
- Can integrate with Kinesis and Spark Streaming
- Can store data processed through Spark in Redshift
- Can analyze data by executing Spark in Athena Console
3.5.7. EMR Hive
- Can query data stored in HDFS and EMRFS using SQL (HiveQL)
- Performs distributed SQL processing based on MapReduce and Tez
- Suitable for OLAP
- Supports User Defined Functions, Thrift, JDBC/ODBC drivers
- Metastore
- Data structure information
- Column name and type information
- Stores metastore in MySQL on master node by default
- Recommends storing in Glue Data Catalog or Amazon RDS in AWS environment
- Integration with AWS Services
- Can load data from S3 and write data to S3
- Can load scripts from S3
3.5.8. EMR Apache Pig
- Provides script environment to quickly help write Mapper and Reducer
- Performs Map and Reduce steps using SQL-like syntax
- Provides user-defined functions (UDFs)
- Performs distributed processing based on MapReduce and Tez
- Integration with AWS Services
- Can query S3 data
- Can load JAR and scripts from S3
3.5.9. EMR HBase
- Non-relational, petabyte-scale database
- Based on HDFS and Google BigTable
- Performs computation in memory
- Integration with AWS Services
- Supports S3 data storage through EMRFS
- Can backup to S3
3.5.10. EMR Presto
- Can connect to various big data databases
- HDFS, S3, Cassandra, MongoDB, HBase, SQL, Redshift, Teradata
- Suitable for OLAP tasks
3.5.11. EMR Hue, Splunk, Flume
- Hue
- Provides web interface for EMR cluster management
- Provides integration with IAM
- Provides data movement functionality between HDFS and S3
- Splunk
- Flume
- Log aggregation platform
- Based on HDFS and HBase
- MXNet
- Neural network construction platform
- Included in EMR
3.5.12. EMR Security
- Can use various authentication/authorization methods
- IAM Policy, IAM Role, Kerberos, SSH
- Block Public Access
- Blocks public access
3.6. AWS Data Pipeline
- Can set S3, RDS, DynamoDB, Redshift, EMR as data transfer destinations
- Manages task dependencies
- Retries tasks and notifies on failure
- Supports cross-region pipelines
- Supports precondition checks
- Supports on-premise data sources
- Provides high availability
- Provides various activities
- EMR, Hive, Copy, SQL, Script
3.7. AWS Step Functions
- Workflow service
- Provides easy visualization functionality
- Provides error handling and retry functionality
- Provides audit functionality
- Provides wait functionality for arbitrary duration
4. Analytics
4.1. Kinesis Analytics
- Real-time data processing service
- Components
- Input Stream : Stream where data enters
- Reference Table : Table referenced during data processing, can join with S3 data
- Output Stream : Stream that outputs processed data
- Error Stream : Stream that outputs data that occurred during data processing
- with Lambda
- Can specify Lambda as data destination
- Modifies data and delivers to AWS services
- with Apache Flink
- Supports Apache Flink framework
- Write Flink applications instead of SQL queries
- Serverless
- Performs auto-scaling
- Measures cost in KPU units
- 1 KPU = 1 vCPU, 4 Memory
- Components
- Flink Source : MSK, Kinesis Data Streams
- Flink Datastream API
- Flink Sink : S3, Kinesis Datastream, Kinesis Data Firehose
- RANDOM_CUT_FOREST
- SQL function that performs anomaly detection
4.2. OpenSearch
- Search engine based on ElasticSearch
- Scalable
- Based on Lucene
- Use cases
- Full-text search
- Log analytics
- Application monitoring
- Security analytics
- Clickstream analytics
- Concept
- Document : Search target, supports not only full-text but also JSON structure
- Types : Schema definition, not commonly used currently
- Indices
- Composed of inverted index
- Composed of multiple shards, performs distributed processing
- Primary Shard : Performs read/write
- Replica Shard : Can only perform reads, performs load balancing when multiple replicas are configured
- Fully-managed (Not Serverless)
- Performs scale in/out without downtime
- Integration with various AWS services
- S3 Bucket
- Kinesis Data Streams
- DynamoDB Streams
- CloudWatch, CloudTrail
- Zone Awareness
- Options
- Dedicated Master Node : Number and spec of nodes
- Domains : Means all information needed to run cluster (configuration information)
- Provides S3-based snapshot functionality
- Zone Awareness
- Security
- Resource-based policy
- Identity-based policy
- IP-based policy
- Request signing
- Private cluster in VPC
- Dashboard Security
- AWS Cognito
- Outside VPC
- SSH Tunnel
- Reserve Proxy on EC2 Instance
- VPC Direct Connect
- VPN
- Not suitable for OLTP tasks
- Does not provide transaction functionality
4.2.1. Index Management
- Storage Type
- Hot Storage
- Fastest store, but high cost
- EBS, Instance Store
- Default store
- Ultra Warm Storage
- Based on S3 + Caching
- Requires dedicated master node
- Advantageous for cases with low writes like logs and immutable data
- Cold Storage
- Based on S3
- Advantageous for storing old or periodic data
- Requires dedicated master node, requires UltraWarm activation
- Not available on T2, T3 instance types
- Data migration possible between storage types
- Hot Storage
- Index State Management (ISM)
- Deletes old indices
- Converts index to read-only
- Moves index from Hot -> UltraWarm -> Cold Storage
- Reduces replica count
- Automatic index snapshot
- Summarizes index through rollup
- Cost reduction
- Executes at 30 ~ 48 minute intervals
- Sends notification when execution completes
- Provides cross-cluster replication functionality
- Ensures high availability
- Follower index synchronizes by fetching data from leader index
- Stability
- Recommends using 3 dedicated master nodes
- Disk capacity management
- Selecting appropriate shard count
- ??
4.2.2. Performance
- When memory pressure occurs
- When shards are unevenly distributed
- When there are too many shards
- Deletes old and unused indices when JVM memory pressure occurs
4.3. Athena
- SQL query service for S3
- Based on Presto
- Serverless
- Supports various formats
- CSV, TSV, JSON, ORC (Columnar), Parquet (Columnar), Avro, Snappy, Zlib, LZO, Gzip
- Athena collects metadata through Glue Data Catalog
- Various use cases
- Ad-hoc queries for app log analysis
- Queries for data analysis before loading data into Redshift
- CloudTrail, CloudFront, VPC, ELB log analysis
- Integration with QuickSight
- Provides workgroups functionality
- Can track query access permissions and costs per workgroup
- Can integrate with IAM, CloudWatch, SNS
- Can configure the following per workgroup
- Query history, data limit, IAM policy, encryption settings
- Cost
- $5 per TB
- Charges per successful or cancelled query
- Failed queries are not charged
- DDL is not charged
- Cost reduction and performance benefits when using columnar format
- Glue and S3 are charged separately
- Performance
- Performance benefits when using columnar format
- Should be composed of fewer large files for performance benefits
- Utilize partition functionality
- Transaction
- Available through Iceberg
- Specify ICEBERG in table type
- Transaction functionality also available through Lake Formation’s governed tables
- Available through Iceberg
4.4. Redshift
- Service for OLAP
- Provides SQL, ODBC, JDBC interfaces
- Scale-up/down on-demand method
- Built-in replication
- Monitoring based on CloudWatch and CloudTrail
- Architecture
- Leader Node
- Receives client queries and creates parallel processing plan
- Distributes tasks to compute nodes and collects results according to parallel processing plan
- Compute Node
- Can configure up to 128 compute nodes
- Type
- Dense Storage : Type with HDD and low-cost large capacity storage
- Dense Compute : Type focused on compute performance
- Leader Node
- Spectrum
- Directly accesses data in S3
- Concurrency limit
- Supports horizontal scaling
- Separates storage and compute
- Performance
- MPP (Massively Parallel Processing)
- Columnar data storage
- Column compression
- Durability
- Data replication occurs within cluster
- Performs data backup to S3
- Automatically recovers failed nodes
- Only RA3 clusters support Multi-AZ (DS2 is Single AZ)
- Scaling
- Performs vertical and horizontal scaling
- New cluster is created and data is migrated when scaling (temporary downtime occurs)
- Data Distribution Style
- Determines how to distribute data to compute nodes
- Auto : Automatically distributes data based on data size
- Even : Automatically distributes data in round-robin fashion
- Key : Distributes data based on key and hashing
- All : Replicates data to all compute nodes
- Sort Key
- Stored sorted on disk according to sort key
- Compound : Combines multiple columns to use as sort key
- Interleaved : ??
- Data Replication
- COPY
- Performs data replication from S3, EMR, DynamoDB remote hosts
- Performs data replication in parallel
- UNLOAD : Performs replication of processed results to S3
- S3 Auto-copy : Automatically replicates to Redshift when data changes in S3
- Aurora zero-ETL Integration : Automatically replicates data from Aurora to Redshift
- Redshift Ingestion
- DBLINK : Performs data replication by connecting to RDS
- COPY
- Integration with AWS Services
- S3, DMS, EMR, EC2, Data Pipeline
- WLM (Workload Management)
- Query queue
- Manages by assigning priority to queries
- Uses 5 queues by default
- Can configure up to 8 queues
- Each queue has concurrency level, can be set up to 50
- Automatically adjusts concurrency level based on query characteristics, can also be set manually
- Concurrency Scaling
- Processes queries by adding clusters
- Sends and processes queries queued in WLM to added clusters
- SQA (Short Query Acceleration)
- Uses queue for short queries in WLM
- Applies to read-only queries and CREATE TABLE AS queries
- Can set short criteria time
- VACUUM
- VACUUM FULL :
- VACUUM DELETE ONLY :
- VACUUM SORT ONLY :
- VACUUM REINDEX :
- Resize
- Elastic Resize
- Can quickly add/remove nodes or change node type (DS2 to RA3)
- Limited number of nodes that can be changed (2, 4, 6, 8…)
- Cluster down occurs for a few minutes, connections remain open
- Classic Resize
- Existing resize method
- Can add/remove nodes, change node type
- Can freely set number of nodes
- Cluster operates in read-only mode for several hours because it creates new cluster and replicates data from existing cluster
- Snapshot
- Performs snapshot and creates cluster of new size
- Elastic Resize
- Security
- Uses HSM (Hardware Security Module)
- Grants privilege permissions to users and groups
- Redshift Serverless
- No need to manage EC2 instances
- Optimizes costs & performance
- Easy to set up development & test environments
- Easy to configure ad-hoc environments
- Can utilize snapshot functionality
5. Visualization
5.1. QuickSight
- Data visualization
- Provides pagination functionality
- Provides alert functionality
- Serverless
- Data Source
- Redshift, Aurora/RDS, Athena, OpenSearch, IoT Analytics, Files (Excel, CSV, TSV)
- SPICE : In-memory engine used in QuickSight
- Specialized for ad-hoc queries
- Security
- Multi-factor auth
- VPC connectivity
- Row-level security
- Private VPC access
- QuickSight can only access data sources in the same region by default
- Pricing
- Annual subscription, monthly subscription
- Additional charges when using extra SPICE capacity
- Dashboard
- Read-only
- Can share externally
- Can embed
- Authentication through Active Directory, Cognito, SSO
- Provides JavaScript SDK and QuickSight API
- ML Functionality
- Anomaly detection functionality
- Prediction functionality
- Insight recommendation functionality
- QuickSight Q
- Performs queries based on natural language