Patform Engineering — Quick Notes

In my opinion, Platform Engineer has 3 major responsibilities:

Automation (CI/CD Workflow): basic application automation, infrastructure automation, data pipeline automation, AI deployement automation
Platform Operation Stability: monitoring and alerting, quality control (SLO), rollback strategy, cost hygiene.
Access Control, Security and Compliance: Least Privilege IAM practice, Secrets Management, Encryption, Network Access Control, Audit Support, Vulnerability Remediation.

This document reorganizes the notes into structured sections for clarity and readability.

Note:

I am currently already familiar with basic application automation and infrastructure automation.

I was less familiar with data pipeline automation but have learned it during the weekends.
Correct me if I'm wrong!

1. Basic Application Automation — Basic APP CI/CD Workflow

flowchart LR %% === Developer & Git Commit === A[🧑‍💻 Developer] -->|Precommit checks| A1[🔍 Lint / Local Test / Format] A -->|commit| GH[(GitHub)] %% === Source Stage === GH --> S[📁 Source] S --> S1[Branch Protection] S1 --> S2[Linting] %% === Build Stage === S --> B[⚙️ Build] B --> B1[Building Image / Compiling Code] B1 --> B2[🐳 Container Image] B2 --> B3[🧪 Unit Tests] B3 --> B4[📊 Code Coverage 80–90%] %% === Test Stage === B --> T[🧫 Test] T --> T1[Integration Test] %% === Release Stage === T --> R[🚀 Release] R --> R1[Ship Image to Registry] R1 --> REG[(📦 Registry)] %% === Style Definition === classDef main fill:#004B8D,stroke:#333,stroke-width:1px,color:#fff,font-size:12px; class S,B,T,R main; linkStyle default stroke:#999,stroke-width:1.2px;

2. Infrastructure Automation

2.1 Automation Approaches

There are two ways to implement infrastructure automation: a visual drag-and-drop orchestration approach (e.g., AWS Step Functions, Azure Logic Apps) and an Infrastructure as Code (IaC) approach using Terraform.

In general, during the exploratory development phase, you can prioritize the drag-and-drop approach to quickly assemble, validate logic, and adjust dependencies; when workflows and architecture have stabilized, use Terraform for code-based management.

Before doing infrastructure automation, you need a basic understanding of commonly used AWS services and their attributes. For example:

When creating an EC2 instance, pay attention to attributes such as subnet, instance type (CPU/memory), security group, and key pair.
When creating an S3 bucket, focus on access control (ACL/Policy), encryption, versioning, and lifecycle rules.

In addition, whether it is EC2, S3, or other resources, when creating them you must configure permissions (for access control) and tagging (for cost hygiene). This will be discussed in detail in subsequent chapters.

2.2 AWS Service Attribute Quick Reference (Updated)

Layer	Service	Main Role	Description / Dependency Logic	Frequently Adjusted Attributes
1. Network Foundation Layer	VPC (Virtual Private Cloud)	Define network boundaries (CIDR, DNS, subnets)	Foundation for all resources; must be created first	`cidr_block`, `enable_dns_hostnames`, `enable_dns_support`, `tags`
	Subnet	Network segmentation (public/private)	Depends on VPC; determines whether resources can access the public Internet	`vpc_id`, `cidr_block`, `availability_zone`, `map_public_ip_on_launch`, `tags`
	Security Group	Network access control	Depends on VPC; controls EC2, RDS, ECS traffic	`ingress`, `egress`, `description`, `tags`
	Route 53	DNS resolution	Can connect to CloudFront, ALB, or API Gateway	`records`, `ttl`, `alias`, `health_check_id`
2. Storage & Encryption Layer	S3 (Simple Storage Service)	Object storage	Can be used for static websites, logs, backup sources	`acl`, `versioning`, `lifecycle_rule`, `public_access_block`, `server_side_encryption_configuration`, `tags`
	EBS (Elastic Block Store)	EC2 block storage	Attached to EC2 instances	`size`, `type`, `iops`, `encrypted`, `kms_key_id`, `tags`
	EFS (Elastic File System)	Shared file system	Mountable to multiple EC2 / ECS	`performance_mode`, `throughput_mode`, `lifecycle_policy`, `encrypted`, `mount_targets`
	KMS (Key Management Service)	Key management & encryption	Used by S3 / EBS / RDS / CloudTrail	`enable_key_rotation`, `policy`, `deletion_window_in_days`, `tags`
3. Identity & Access Layer	IAM Role / Policy	Permissions & access control	Required by all services (EC2, Lambda, Terraform)	`assume_role_policy`, `inline_policies`, `managed_policy_arns`, `tags`
4. Compute & Container Layer	EC2 (Elastic Compute Cloud)	Virtual machine instances	Depends on VPC, Subnet, SG, IAM, EBS	`ami`, `instance_type`, `key_name`, `subnet_id`, `vpc_security_group_ids`, `user_data`, `tags`
	ECS / EKS	Container orchestration	Depends on EC2 / Fargate, VPC, SG, IAM	`cluster_name`, `task_definition`, `service`, `network_configuration`, `execution_role_arn`
	Lambda	Serverless functions	Depends on IAM Role, VPC (optional)	`runtime`, `handler`, `timeout`, `memory_size`, `environment`, `vpc_config`, `source_code_hash`, `tags`
5. Data & Messaging Layer	RDS (Relational Database Service)	Managed database	Depends on VPC, Subnet, SG, KMS (encryption)	`engine_version`, `instance_class`, `storage_type`, `multi_az`, `backup_retention_period`, `vpc_security_group_ids`, `tags`
	DynamoDB (NoSQL Database)	Key-value / document database	Serverless NoSQL; often used by Lambda, Step Functions, Terraform backend locking	`hash_key`, `range_key`, `billing_mode`, `read_capacity`, `write_capacity`, `stream_enabled`, `ttl`, `tags`
	SQS (Simple Queue Service)	Queue-based communication	Asynchronous decoupling for Lambda / ECS	`delay_seconds`, `message_retention_seconds`, `visibility_timeout_seconds`, `fifo_queue`, `tags`
	SNS (Simple Notification Service)	Message notifications	Works with CloudWatch / Lambda / SQS	`topic_name`, `subscription`, `delivery_policy`, `tags`
6. Network & Distribution Layer	Load Balancer (ALB/NLB)	Traffic distribution	Depends on Subnet, SG; connects to EC2 / ECS	`listener`, `target_group`, `health_check`, `subnets`, `security_groups`, `tags`
	CloudFront (CDN)	Global content distribution	Depends on S3 / ALB / ACM certificates	`origin`, `default_cache_behavior`, `viewer_certificate`, `aliases`, `enabled`, `tags`
7. Monitoring & Logging Layer	CloudWatch	Monitoring & alerting	Depends on SNS; collects logs from Lambda / ECS / EC2	`metric_alarm`, `threshold`, `retention_in_days`, `comparison_operator`, `sns_topic_arn`
	CloudTrail	Audit logs	Depends on S3, KMS; tracks API operations	`is_multi_region_trail`, `s3_bucket_name`, `enable_log_file_validation`, `cloud_watch_logs_group_arn`, `tags`
8. IaC & Automation Layer	CloudFormation / Terraform Backend	Infrastructure as Code (IaC)	Depends on S3 (state storage) + DynamoDB (locking)	`bucket`, `dynamodb_table`, `region`, `encrypt`, `kms_key_id`
9. Image & Artifact Layer	ECR (Elastic Container Registry)	Image registry	Used by ECS / EKS to pull images	`image_scanning_configuration`, `encryption_configuration`, `lifecycle_policy`, `repository_name`, `tags`

2.3 Terraform Basics

2.3.1 Overview

Terraform is an Infrastructure as Code (IaC) tool that acts as a translator between your configuration files (written in HashiCorp Configuration Language, HCL) and the APIs of various cloud providers (like AWS, Azure, GCP, OCI). Terraform configurations are made up of several key components:

Concept	Description
Provider	A plugin provided by cloud vendors (e.g., AWS, OCI, Azure) that allows Terraform to communicate with their APIs — essentially the translator between Terraform and the cloud.
Resource	The actual cloud objects defined in Terraform, such as compute instances, storage buckets, or databases.
Output	Used to display or pass results after Terraform execution — for example, the public IP address of an instance or a database endpoint.

2.3.2 Example: Resource Definition (OCI Instance)

resource "oci_core_instance" "arm_instance" {
  availability_domain = var.availability_domain
  compartment_id      = var.tenancy_ocid  # Free-tier users usually have only the root compartment, whose ID is equal to tenancy OCID
  shape               = var.instance_shape

  shape_config {
    ocpus         = var.ocpus
    memory_in_gbs = var.memory_in_gbs
  }

  create_vnic_details {
    subnet_id        = var.subnet_id
    assign_public_ip = true
  }

  source_details {
    source_type             = var.instance_source_type
    source_id               = var.source_id
    boot_volume_size_in_gbs = var.boot_volume_size_in_gbs
  }

  metadata = {
    ssh_authorized_keys = var.ssh_authorized_keys
  }
}

2.3.3 Terraform Execution Workflow

A typical Terraform workflow consists of the following steps:

terraform init 
-> terraform plan
-> terraform apply
   ├─ ① Retrieve current state (from local or backend)
   ├─ ② Lock the state (state locking)
   ├─ ③ Apply changes (create / update / destroy)
   ├─ ④ Write updated state file
   └─ ⑤ Unlock state

In collaborative or production environments, it is strongly recommended to store Terraform state in a remote backend with state locking enabled.AWS officially recommends using the following combination:

S3 backend → stores the Terraform state file (state)
DynamoDB → provides locking to prevent concurrent modifications

2.3.4 Terraform Modularization

root/
├── main.tf                # Root module (invokes submodules)
├── variables.tf
├── outputs.tf
└── modules/
    └── ec2_instance/      # Submodule
        ├── main.tf
        ├── variables.tf
        └── outputs.tf

2.3.5 EKS Cluster and Infrastructure Automation

The creation of an EKS cluster is also part of infrastructure automation. In an EKS environment, Terraform is typically responsible for:

Setting up the network environment for EKS (VPC, subnets, security groups);
Creating Node Group instances for running Pods;
Managing IAM roles and policies;
Configuring monitoring components (e.g., CloudWatch).

After deployment, what you get is essentially an empty Kubernetes operating system — Terraform provisions the environment, but it does not control what runs inside the cluster.

Kubernetes, on the other hand, uses YAML manifests to define what runs within the cluster:

Which application (container image) to run
How many replicas to deploy
Which ports to expose and monitor
Which configuration or storage resources to attach

These application-level details are outside Terraform’s scope.

Note: Terraform can also indirectly deploy containerized apps by invoking the Kubernetes API (similar to kubectl apply), but for clarity and maintainability, avoid using Terraform for application deployment in the infrastructure initialization stage.

3. Data Pipeline Automation

Why data pipeline automation is needed: Data pipeline automation makes “data supply” as repeatable, observable, and rollback-able as application delivery. It uses engineering practices to ensure stable, fast, and low-cost delivery of reliable data to BI/AI applications.

What is Snowflake: originally focused on data warehouse, SQL, Python support; SaaS.
What is Databricks: originally focused on data lake and ML; PaaS.
Snowflake and Databricks: both can be regarded as data warehouses in this context.
What is dbt: a framework that sits on top of the data warehouse. It turns warehouse data into usable data products via models, tests, and environments. Its run/test commands integrate well with CI/CD. Orchestrators (e.g., Airflow) or global CI/CD pipelines call dbt run to transform models into cleaned & modeled tables/views, and dbt test to enforce data quality.

3.1 dbt Models

Each dbt model (.sql file) is essentially executable SQL + a Jinja templating layer. A model defines the full mapping from upstream sources → transformation logic → target tables/views.

Common model types:

Type	Description	Examples
Staging Models	Light cleaning and standardization of raw data	Remove duplicates, rename fields
Intermediate Models	Aggregate and join different tables	joins, aggregation logic
Mart Models	Final result layer for analytics/business	User reports, sales KPIs

Example:

-- models/marts/sales_summary.sql
{{ config(materialized='table') }}

SELECT
    customer_id,
    SUM(amount) AS total_sales,
    COUNT(order_id) AS order_count
FROM {{ ref('stg_orders') }}
GROUP BY 1

3.2 dbt Tests

dbt supports two testing mechanisms: Generic (built-in) tests and Custom tests.

Common test types:

Test Type	Purpose	Example
`unique`	Validate field uniqueness	Primary key duplication check
`not_null`	Check non-null constraints	`customer_id` cannot be null
`accepted_values`	Validate value domain	`status` ∈ {active, inactive}
`relationships`	Validate foreign keys	Orders → Users

Example:

version: 2
models:
  - name: sales_summary
    tests:
      - unique:
          column_name: customer_id
      - not_null:
          column_name: customer_id

3.3 dbt Environments

dbt uses profiles.yml to manage different deployment environments (e.g., dev, staging, prod). Each environment defines connection parameters, credentials, schema, etc., to achieve environment isolation and continuous delivery.

Typical structure:

# ~/.dbt/profiles.yml
my_project:
  target: dev
  outputs:
    dev:
      type: snowflake
      account: abc123.ap-southeast-2
      user: dbt_user
      password: "{{ env_var('DBT_PASSWORD') }}"
      role: DEVELOPER
      database: ANALYTICS_DEV
      warehouse: COMPUTE_WH
      schema: DEV_SCHEMA
    prod:
      type: snowflake
      account: abc123.ap-southeast-2
      user: dbt_user
      password: "{{ env_var('DBT_PASSWORD') }}"
      role: PROD_ROLE
      database: ANALYTICS_PROD
      warehouse: COMPUTE_WH
      schema: PROD_SCHEMA

In CI/CD, you can switch the target environment dynamically via environment variables, e.g.:

dbt run --target prod

3.4 Modern Data Platform Layer Model

┌───────────────────────────────┐
│  1️⃣ Data Ingestion Layer      │
│  Data Collection Layer        │
│  Source → Collect raw data    │
│  📥 Output: Raw Data          │
│  Tools: APIs, Scripts         │
└──────────────┬────────────────┘
               │
               ▼
┌───────────────────────────────┐
│  2️⃣ Data Storage Layer        │
│  Data Storage Layer           │
│  Centralize & store data      │
│  🗄️ Output: Structured /       │
│       Semi-structured zone    │
│  Tools: S3, Snowflake,        │
│         Databricks SQL        │
└──────────────┬────────────────┘
               │
               ▼
┌───────────────────────────────┐
│  3️⃣ Data Processing Layer     │
│  Data Processing Layer        │
│  Clean, transform, standardize│
│  ⚙️ Output: Cleaned Tables     │
│  Tools: dbt                   │
└──────────────┬────────────────┘
               │
               ▼
┌───────────────────────────────┐
│  4️⃣ Data Orchestration Layer  │
│  Data Orchestration (optional)│
│  Manage ETL/ELT workflows     │
│  Automate dependencies & CI/CD│
│  🧠 Output: Reliable pipelines │
│  Tools: Airflow               │
└──────────────┬────────────────┘
               │
               ▼
┌───────────────────────────────┐
│  5️⃣ Data Consumption Layer    │
│  Data Consumption Layer       │
│  Make data “usable”           │
│  📊 Output: Reports, Dashboard│
│  Tools: Power BI, Tableau     │
└───────────────────────────────┘

─────────────────────────────────────────────
⚙️ Data Governance & Security Layer
Data governance & security — spans the entire data lifecycle
─────────────────────────────────────────────
Responsibilities:
 • Metadata management (Data Lineage / Data Catalog)
 • Data access & permission control (Access Control)
 • Data quality monitoring (Data Quality)
 • Compliance & audit (GDPR / HIPAA / ISO)
Tools:
 • DataHub, Collibra, Alation, Amundsen
 • Monte Carlo, AWS Glue Data Catalog
Output:
 🔒 Trusted, Compliant, and Secure Data
─────────────────────────────────────────────

3.5 Typical Data Pipeline Directory and Execution Order

A typical data platform directory structure:

data-platform/
├─ infra/                 # Terraform scripts to create VPC, EKS, Databricks, Snowflake Warehouse; set up IAM and Secrets; provide runtime for later modules.
├─ pipelines/             # Orchestrator scripts (Airflow DAG / Dagster job) for scheduling
│  ├─ dags/
│  └─ requirements.txt
├─ dbt/                   # dbt module (runs models and data tests; failures block release)
│  ├─ models/             # SQL models (reusable SQL modules)
│  ├─ tests/              # Data quality tests
│  └─ profiles_template.yml  # Render different env vars (dev/prod) in CI
├─ jobs/                  # Spark/Databricks/PySpark scripts (data engineering logic)
├─ ops/                   # SLO quality checks and ops scripts
├─ deploy/                # Containerized deploy scripts for dbt-runner, Airflow, Databricks/Snowflake
└─ .github/workflows/     # Global CI/CD pipelines

One-sentence summary: infra lays the foundation, deploy deploys services, pipelines schedules, jobs compute, dbt models & validates, ops ensures stable running — all chained by global .github/workflows.

flowchart LR A[infra/ Terraform environment] --> B[deploy/ Deploy Airflow/Databricks/dbt-runner] B --> C[pipelines/ Airflow/Dagster orchestration] C --> D[jobs/ Spark/Databricks computation] D --> E[dbt/ SQL models & data tests] E --> F[ops/ SLO verification / alerting / rollback] F -.Results / Metrics.-> C

3.6 MLOps (Machine Learning Operations)

Why here? MLOps bridges the data platform (Section 3) and platform operations (Section 4) by enabling reliable, observable, and secure model delivery to production.

3.6.1 Overview

MLOps extends DevOps practices to machine learning systems so models can be continuously trained, deployed, monitored, and improved in production.

It integrates three domains:

Domain	Responsibility	Example Tools
ML Development (Data Science)	Model design, training, experimentation	Jupyter, PyTorch, TensorFlow, Hugging Face
Data Engineering	Data ingestion, feature pipelines	Airflow, dbt, Spark, Databricks
Platform Engineering / DevOps	CI/CD, infrastructure, observability	GitHub Actions, Kubernetes, MLflow, Seldon Core

3.6.2 MLOps Lifecycle

flowchart LR A[Data Collection] --> B[Feature Engineering] B --> C[Model Training] C --> D[Model Evaluation] D --> E[Model Registry] E --> F[Model Deployment] F --> G[Model Monitoring] G --> H[Feedback Loop → Retraining]

Data & Feature Engineering

Reproducible data pipelines; Feature Store (Feast/Databricks).
Data versioning via DVC/LakeFS; schema contracts to avoid breaking changes.

Model Training & Experiment Tracking

Track experiments with MLflow or Weights & Biases.
Store artifacts centrally (e.g., S3 + MLflow backend).
Automate training via scheduled DAGs or CI jobs.

Model Registry

Register model versions with metadata (owner, metrics, timestamp).
Examples: MLflow Registry, SageMaker Model Registry.

Model Deployment

Batch inference: Spark/Databricks jobs, AWS Batch.
Online inference: FastAPI on EKS, Lambda, SageMaker Endpoint, Seldon Core.
Use canary / shadow deploys for safe rollout.

Model Monitoring

Track prediction/concept drift, latency, and error rates.
Emit metrics to CloudWatch / Prometheus, traces/logs to Dynatrace.

Continuous Retraining

Trigger retraining on drift thresholds or SLO breaches.
Push new model to registry → gated deploy with approvals.

3.6.3 MLOps CI/CD Example

Typical steps:

1️⃣ Git Commit → code/data versioning (Git + DVC)
2️⃣ CI → unit tests, data validation, training
3️⃣ Store artifacts → MLflow / S3
4️⃣ CD → deploy serving (API/Batch)
5️⃣ Monitor → metrics, drift detection
6️⃣ Feedback → retrain on trigger

Minimal GitHub Actions sketch:

name: mlops-ci-cd
on: [push]
jobs:
  build-train-deploy:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - name: Train model
        run: python train.py
      - name: Register model
        run: mlflow register --model-path outputs/model.pkl
      - name: Deploy model
        run: python deploy.py --env prod

3.6.4 Integration with Platform Engineering

Platform Component	MLOps Integration
Infrastructure Automation	Provision GPU nodes, S3 buckets, SageMaker/Databricks via Terraform
Data Pipeline Automation	Supply versioned, validated datasets for model training
Operations & Stability	Apply SLOs to ML APIs and batch jobs; monitor accuracy & drift
Security & Compliance	Encrypt model artifacts, restrict IAM to registries, ensure GDPR/PII compliance

4. Platform Operations and Stability

4.1 Concepts: SLO and Observability

SLO (Service Level Objective) represents the target performance or availability expected of a service — a core metric for platform stability.

Common dimensions:

Metric Category	Example Metric	Example Target
Availability	API success rate	≥ 99.9%
Performance	P95 latency	< 500 ms
Correctness	Job success rate	≥ 98%
Cost Efficiency	Resource utilization	≥ 80%

Observability refers to the system’s ability to be observed, understood, and diagnosed. It includes three core pillars:

Metrics: e.g., latency, CPU usage, throughput
Logs: event records from systems and applications
Traces: cross-service request tracing (distributed call analysis)

4.2 Monitoring & Alerting — Tools and Usage

When a deployment fails or SLOs are not met, trigger alerts. For example:

Condition:
  IF error_rate > 0.5% OR latency_p95 > 800ms
  THEN trigger alarm → SNS topic → Slack channel #platform-alerts

With CloudWatch Alarm Actions, you can invoke Lambda to automatically:

Perform rollbacks;
Restart containers;
Notify on-call engineers;
File an incident ticket.

Besides AWS CloudWatch, you can use tools like Dynatrace for deeper operational data analysis and incident investigation.

4.3 Incident Management & Optimization

Goal	Key Practices
Detection coverage	Ensure probes exist across Infra / App / Data pipeline layers
False-positive control	Use multiple metrics + time windows (e.g., continuous 5 minutes)
Automated response	CloudWatch Alarm + Lambda + Runbook
Incident tracking	Auto-generate tickets / Jira records
Root Cause Analysis (RCA)	Use Dynatrace to auto-correlate logs and traces

For releases, you can use canary release with partial rollbacks as needed.

4.4 Cloud Cost Governance (Cost Hygiene Principle)

Tagging: tag infra resources to track cost ownership;
Rightsizing: size resources appropriately to avoid waste;
Lifecycle Policies: clean up by lifecycle to prevent accumulation.

Concrete actions:

1️⃣ Tagging — Automated Detection****Goal: Ensure all EC2, S3, EBS, RDS, Lambda, etc., have necessary tags: Project, Environment, Owner, CostCenter.Tooling:

AWS Config Rules — detect missing tags.
AWS Tag Editor — central view and bulk tagging.
Custom Lambda scripts — auto-detect and notify owners.

Example (Python + boto3):

import boto3
ec2 = boto3.client('ec2')
instances = ec2.describe_instances()['Reservations']
for r in instances:
    for i in r['Instances']:
        tags = {t['Key']: t['Value'] for t in i.get('Tags', [])}
        if 'Project' not in tags or 'Owner' not in tags:
            print(f"[WARN] Instance {i['InstanceId']} missing tags: {tags}")
# Can be run periodically or as AWS Lambda + CloudWatch Event to scan daily.

**2️⃣ Rightsizing (Resource Optimization)**Goal: Identify over-provisioned or long-idle resources (e.g., EC2 CPU utilization < 10% for 7 consecutive days).Tooling:

AWS Compute Optimizer — official recommendation; analyzes EC2/EBS/Lambda loads and proposes adjustments.
AWS Trusted Advisor — cost & security recommendations (includes idle resource checks).
Custom CloudWatch Alarm — alert on low CPU/Mem thresholds.

Example (flag idle instances):

aws cloudwatch get-metric-statistics \
  --metric-name CPUUtilization \
  --start-time 2025-10-10T00:00:00Z \
  --end-time 2025-10-12T00:00:00Z \
  --period 86400 \
  --namespace AWS/EC2 \
  --statistics Average \
  --dimensions Name=InstanceId,Value=i-1234567890abcdef
# → Write results to a Lambda to decide auto-shutdown.

**3️⃣ Lifecycle Policies (Periodic Cleanup)**Goal: Regularly clean stale data, logs, snapshots, and temp resources.Tooling:

S3 Lifecycle Policy — auto-transition/delete old objects.
EBS Snapshot Lifecycle Manager (DLM) — auto-delete expired snapshots.
Terraform Lifecycle Rules — control destroy/keep logic.
Scheduled Lambda cleanup — remove untagged temp resources (e.g., “test”, “tmp”).

Example (S3 lifecycle policy):

{
  "Rules": [
    {
      "ID": "DeleteOldLogs",
      "Prefix": "logs/",
      "Status": "Enabled",
      "Expiration": {"Days": 30}
    }
  ]
}

Past practiceIn a previous company, even with ample resources, I still applied conservatively, following:

Apply on Demand: request resources only when truly needed to avoid idling;
Reuse Before Apply: prefer reusing existing environments/instances;
Utilization-informed strategy: summarize utilization from the last deployment to guide the next request.

5. Security & Compliance

5.1 IAM and Least-Privilege Access

5.1.1 IAM (Identity and Access Management)

IAM (Identity and Access Management) is AWS’s identity and access control system. Its core goal is:

“Ensure the right people and services, at the right time, can access only the right resources.”

In other words, IAM determines:

Who (Role) can access;
What (Resource) can be accessed;
Under what conditions (Condition);
Which actions (Action) are permitted.

5.1.2 Policy Example

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": "s3:PutObject",
      "Resource": "arn:aws:s3:::temporary-upload-bucket/*",
      "Condition": {
        "DateLessThan": {
          "aws:CurrentTime": "2025-10-13T00:00:00Z"
        }
      }
    }
  ]
}

5.1.3 RBAC Model & Least Privilege

AWS IAM follows a role-based access control (RBAC) model, where each policy must be attached to an IAM entity such as an IAM account, IAM user group, IAM user, or IAM role.
An IAM Role is a temporary identity that can be assumed by AWS services, users, or external systems to obtain specific permissions for a limited time.
The Least Privilege Principle ensures that each IAM entity has only the minimal permissions required to perform its tasks.

5.1.4 Policy Layers

Policy Type	Chinese Name	Scope	Typical Scenarios	Characteristics
✅Account Policy (Global)	账户级策略	Affects all users/resources under an AWS account or organization	Unified security baseline/global restrictions (e.g., disallow deleting CloudTrail, disabling billing monitoring)	Implemented viaService Control Policies (SCP) in AWS Organizations by admins
✅Group Policy (Team-level)	用户组级策略	Applied to a user group; uniform permissions for members	Grant Dev group “S3 read/write + EC2 start/stop”; Finance group “view billing”	Facilitates team-wide authorization; members inherit automatically
✅Role Policy (Service-level)	角色级策略	Granted to services (Lambda/EC2/EKS Pod/CI-CD tools) or cross-account access	Lambda accessing DynamoDB, EC2 pushing logs to S3, GitHub Actions deploying to AWS	Formachine identities / temporary access (AssumeRole); follow least privilege
✅Inline Policy (Exception-level)	内联策略	Embedded directly in a specific user/group/role	Temporary emergency grants, short-term tests/special tasks (e.g., open S3 upload for one day)	Same lifecycle as bound object; removed on deletion; not reusable

In one sentence: Account Policies control what the entire account can do; Group Policies control what groups of people can do; Role Policies control what a service/system can do; Inline Policies are for short-term exceptions or temporary grants.

5.1.5 ARN Components

ARN (Amazon Resource Name) is the unique identifier for an AWS resource.

Part	Meaning	Example
`arn`	Fixed prefix indicating an ARN	`arn`
`partition`	AWS partition (usually `aws`)	`aws` (global) / `aws-cn` (China) / `aws-us-gov` (GovCloud)
`service`	Service name	`s3`, `ec2`, `iam`, `lambda`, `dynamodb`
`region`	Region code	`ap-southeast-2` (Sydney)
`account-id`	AWS account ID (12 digits)	`123456789012`
`resource`	Specific resource name/path	`bucket-name`, `instance/i-12345`, `role/MyLambdaRole`

5.1.6 ARN Examples

Service	Example ARN	Description
S3 Bucket	`arn:aws:s3:::my-data-bucket`	Entire bucket
S3 Object	`arn:aws:s3:::my-data-bucket/images/photo.jpg`	Specific object in bucket
EC2 Instance	`arn:aws:ec2:ap-southeast-2:123456789012:instance/i-0abc1234def567890`	A VM instance
Lambda Function	`arn:aws:lambda:ap-southeast-2:123456789012:function:MyFunction`	A function
IAM Role	`arn:aws:iam::123456789012:role/MyLambdaRole`	An IAM role
DynamoDB Table	`arn:aws:dynamodb:ap-southeast-2:123456789012:table/Users`	A table

5.1.7 ARN in Policies

In the Resource field of an IAM policy, the ARN specifies the target resource to allow/deny.

{
  "Effect": "Allow",
  "Action": "s3:GetObject",
  "Resource": "arn:aws:s3:::my-data-bucket/*"
}

Meaning: allow reading all objects under the my-data-bucket S3 bucket.

5.1.8 Wildcards in ARNs

Sometimes you need to match a class of resources using *:

"Resource": "arn:aws:lambda:ap-southeast-2:123456789012:function:*"

This means: allow access to all Lambda functions in the current account within that region.

5.1.9 Policy Reviews

✅ Use IAM Access Analyzer to detect over-privileged permissions
✅ Use CloudTrail to audit all IAM change operations

5.1.10 Past Practice

In previous internships, the initial approach was direct access from a local program to AWS Bedrock, which made approval difficult.
Later, calling Bedrock via AWS Lambda reduced permission requirements and achieved the goal.

5.2 Secrets Management

5.2.1 Management Method

AWS Secrets Manager + AWS KMS (Key Management Service)

5.2.2 Secret Retrieval Flow

sequenceDiagram participant App as Application participant SM as AWS Secrets Manager participant KMS as AWS KMS participant DB as Database App->>SM: Request secret (GetSecretValue) SM->>KMS: Decrypt stored secret KMS-->>SM: Return plaintext credentials SM-->>App: Return credentials (JSON) App->>DB: Use credentials to establish connection

Note: Secrets Manager can enable automatic rotation, generating a Rotation Lambda Function to periodically create new passwords.

5.2.3 Secrets Manager Storage Structure (Example)

{
  "username": "admin_user",
  "password": "N9pLxkK4Q7z!",
  "engine": "mysql",
  "host": "db-prod.cluster-xxxx.ap-southeast-2.rds.amazonaws.com",
  "port": 3306,
  "dbname": "production"
}

5.2.4 Access Control for Secrets (IAM + ARN)

Each secret has a unique ARN, for example:
arn:aws:secretsmanager:ap-southeast-2:123456789012:secret:prod/db-password-AbCdE

To access that db-password, a user or service must have a policy like:

{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Action": [
      "secretsmanager:GetSecretValue"
    ],
    "Resource": "arn:aws:secretsmanager:ap-southeast-2:123456789012:secret:prod/db-password-AbCdE"
  }]
}

5.2.5 Past Practice

GitHub Secrets;
Use of self-hosted enterprise-grade KMS.

5.3 Encryption in Transit / At Rest

5.3.1 Past Practice

Frontend username/password transmitted with MD5; Flask salted; ciphertext stored in DB;
TLS/SSL (HTTPS) to encrypt site access;
Encrypted protocols between services (e.g., Remote Desktop, vmess, ssr);
Application layer: VPN.

5.4 Network Controls

5.4.1 Past Practice

Subnets: initially designed subnets + bastion for the alcohol auditor;
Security Groups: when using OCI, configured accessible IPs and traffic ports;
Linux firewall: for personal sites, only opened necessary service ports; others closed;
Microservice traffic control: used Spring Cloud Gateway and Alibaba Sentinel for rate limiting.