Patform Engineering — Quick Notes

In my opinion, Platform Engineer has 3 major responsibilities:

  1. Automation (CI/CD Workflow): basic application automation, infrastructure automation, data pipeline automation, AI deployement automation
  2. Platform Operation Stability: monitoring and alerting, quality control (SLO), rollback strategy, cost hygiene.
  3. Access Control, Security and Compliance: Least Privilege IAM practice, Secrets Management, Encryption, Network Access Control, Audit Support, Vulnerability Remediation.

This document reorganizes the notes into structured sections for clarity and readability.

Note:

I am currently already familiar with basic application automation and infrastructure automation.

I was less familiar with data pipeline automation but have learned it during the weekends.
Correct me if I'm wrong!


1. Basic Application Automation — Basic APP CI/CD Workflow

flowchart LR %% === Developer & Git Commit === A[🧑‍💻 Developer] -->|Precommit checks| A1[🔍 Lint / Local Test / Format] A -->|commit| GH[(GitHub)] %% === Source Stage === GH --> S[📁 Source] S --> S1[Branch Protection] S1 --> S2[Linting] %% === Build Stage === S --> B[⚙️ Build] B --> B1[Building Image / Compiling Code] B1 --> B2[🐳 Container Image] B2 --> B3[🧪 Unit Tests] B3 --> B4[📊 Code Coverage 80–90%] %% === Test Stage === B --> T[🧫 Test] T --> T1[Integration Test] %% === Release Stage === T --> R[🚀 Release] R --> R1[Ship Image to Registry] R1 --> REG[(📦 Registry)] %% === Style Definition === classDef main fill:#004B8D,stroke:#333,stroke-width:1px,color:#fff,font-size:12px; class S,B,T,R main; linkStyle default stroke:#999,stroke-width:1.2px;

2. Infrastructure Automation

2.1 Automation Approaches

There are two ways to implement infrastructure automation: a visual drag-and-drop orchestration approach (e.g., AWS Step Functions, Azure Logic Apps) and an Infrastructure as Code (IaC) approach using Terraform.

In general, during the exploratory development phase, you can prioritize the drag-and-drop approach to quickly assemble, validate logic, and adjust dependencies; when workflows and architecture have stabilized, use Terraform for code-based management.

Before doing infrastructure automation, you need a basic understanding of commonly used AWS services and their attributes. For example:

  • When creating an EC2 instance, pay attention to attributes such as subnet, instance type (CPU/memory), security group, and key pair.
  • When creating an S3 bucket, focus on access control (ACL/Policy), encryption, versioning, and lifecycle rules.

In addition, whether it is EC2, S3, or other resources, when creating them you must configure permissions (for access control) and tagging (for cost hygiene). This will be discussed in detail in subsequent chapters.

2.2 AWS Service Attribute Quick Reference (Updated)

Layer Service Main Role Description / Dependency Logic Frequently Adjusted Attributes
1. Network Foundation Layer VPC (Virtual Private Cloud) Define network boundaries (CIDR, DNS, subnets) Foundation for all resources; must be created first cidr_block, enable_dns_hostnames, enable_dns_support, tags
Subnet Network segmentation (public/private) Depends on VPC; determines whether resources can access the public Internet vpc_id, cidr_block, availability_zone, map_public_ip_on_launch, tags
Security Group Network access control Depends on VPC; controls EC2, RDS, ECS traffic ingress, egress, description, tags
Route 53 DNS resolution Can connect to CloudFront, ALB, or API Gateway records, ttl, alias, health_check_id
2. Storage & Encryption Layer S3 (Simple Storage Service) Object storage Can be used for static websites, logs, backup sources acl, versioning, lifecycle_rule, public_access_block, server_side_encryption_configuration, tags
EBS (Elastic Block Store) EC2 block storage Attached to EC2 instances size, type, iops, encrypted, kms_key_id, tags
EFS (Elastic File System) Shared file system Mountable to multiple EC2 / ECS performance_mode, throughput_mode, lifecycle_policy, encrypted, mount_targets
KMS (Key Management Service) Key management & encryption Used by S3 / EBS / RDS / CloudTrail enable_key_rotation, policy, deletion_window_in_days, tags
3. Identity & Access Layer IAM Role / Policy Permissions & access control Required by all services (EC2, Lambda, Terraform) assume_role_policy, inline_policies, managed_policy_arns, tags
4. Compute & Container Layer EC2 (Elastic Compute Cloud) Virtual machine instances Depends on VPC, Subnet, SG, IAM, EBS ami, instance_type, key_name, subnet_id, vpc_security_group_ids, user_data, tags
ECS / EKS Container orchestration Depends on EC2 / Fargate, VPC, SG, IAM cluster_name, task_definition, service, network_configuration, execution_role_arn
Lambda Serverless functions Depends on IAM Role, VPC (optional) runtime, handler, timeout, memory_size, environment, vpc_config, source_code_hash, tags
5. Data & Messaging Layer RDS (Relational Database Service) Managed database Depends on VPC, Subnet, SG, KMS (encryption) engine_version, instance_class, storage_type, multi_az, backup_retention_period, vpc_security_group_ids, tags
DynamoDB (NoSQL Database) Key-value / document database Serverless NoSQL; often used by Lambda, Step Functions, Terraform backend locking hash_key, range_key, billing_mode, read_capacity, write_capacity, stream_enabled, ttl, tags
SQS (Simple Queue Service) Queue-based communication Asynchronous decoupling for Lambda / ECS delay_seconds, message_retention_seconds, visibility_timeout_seconds, fifo_queue, tags
SNS (Simple Notification Service) Message notifications Works with CloudWatch / Lambda / SQS topic_name, subscription, delivery_policy, tags
6. Network & Distribution Layer Load Balancer (ALB/NLB) Traffic distribution Depends on Subnet, SG; connects to EC2 / ECS listener, target_group, health_check, subnets, security_groups, tags
CloudFront (CDN) Global content distribution Depends on S3 / ALB / ACM certificates origin, default_cache_behavior, viewer_certificate, aliases, enabled, tags
7. Monitoring & Logging Layer CloudWatch Monitoring & alerting Depends on SNS; collects logs from Lambda / ECS / EC2 metric_alarm, threshold, retention_in_days, comparison_operator, sns_topic_arn
CloudTrail Audit logs Depends on S3, KMS; tracks API operations is_multi_region_trail, s3_bucket_name, enable_log_file_validation, cloud_watch_logs_group_arn, tags
8. IaC & Automation Layer CloudFormation / Terraform Backend Infrastructure as Code (IaC) Depends on S3 (state storage) + DynamoDB (locking) bucket, dynamodb_table, region, encrypt, kms_key_id
9. Image & Artifact Layer ECR (Elastic Container Registry) Image registry Used by ECS / EKS to pull images image_scanning_configuration, encryption_configuration, lifecycle_policy, repository_name, tags

2.3 Terraform Basics

2.3.1 Overview

Terraform is an Infrastructure as Code (IaC) tool that acts as a translator between your configuration files (written in HashiCorp Configuration Language, HCL) and the APIs of various cloud providers (like AWS, Azure, GCP, OCI). Terraform configurations are made up of several key components:

Concept Description
Provider A plugin provided by cloud vendors (e.g., AWS, OCI, Azure) that allows Terraform to communicate with their APIs — essentially the translator between Terraform and the cloud.
Resource The actual cloud objects defined in Terraform, such as compute instances, storage buckets, or databases.
Output Used to display or pass results after Terraform execution — for example, the public IP address of an instance or a database endpoint.

2.3.2 Example: Resource Definition (OCI Instance)

resource "oci_core_instance" "arm_instance" {
  availability_domain = var.availability_domain
  compartment_id      = var.tenancy_ocid  # Free-tier users usually have only the root compartment, whose ID is equal to tenancy OCID
  shape               = var.instance_shape

  shape_config {
    ocpus         = var.ocpus
    memory_in_gbs = var.memory_in_gbs
  }

  create_vnic_details {
    subnet_id        = var.subnet_id
    assign_public_ip = true
  }

  source_details {
    source_type             = var.instance_source_type
    source_id               = var.source_id
    boot_volume_size_in_gbs = var.boot_volume_size_in_gbs
  }

  metadata = {
    ssh_authorized_keys = var.ssh_authorized_keys
  }
}

2.3.3 Terraform Execution Workflow

A typical Terraform workflow consists of the following steps:

terraform init 
-> terraform plan
-> terraform apply
   ├─ ① Retrieve current state (from local or backend)
   ├─ ② Lock the state (state locking)
   ├─ ③ Apply changes (create / update / destroy)
   ├─ ④ Write updated state file
   └─ ⑤ Unlock state

In collaborative or production environments, it is strongly recommended to store Terraform state in a remote backend with state locking enabled.AWS officially recommends using the following combination:

  • S3 backend → stores the Terraform state file (state)
  • DynamoDB → provides locking to prevent concurrent modifications

2.3.4 Terraform Modularization

root/
├── main.tf                # Root module (invokes submodules)
├── variables.tf
├── outputs.tf
└── modules/
    └── ec2_instance/      # Submodule
        ├── main.tf
        ├── variables.tf
        └── outputs.tf

2.3.5 EKS Cluster and Infrastructure Automation

The creation of an EKS cluster is also part of infrastructure automation. In an EKS environment, Terraform is typically responsible for:

  • Setting up the network environment for EKS (VPC, subnets, security groups);
  • Creating Node Group instances for running Pods;
  • Managing IAM roles and policies;
  • Configuring monitoring components (e.g., CloudWatch).

After deployment, what you get is essentially an empty Kubernetes operating system — Terraform provisions the environment, but it does not control what runs inside the cluster.

Kubernetes, on the other hand, uses YAML manifests to define what runs within the cluster:

  • Which application (container image) to run
  • How many replicas to deploy
  • Which ports to expose and monitor
  • Which configuration or storage resources to attach

These application-level details are outside Terraform’s scope.

Note: Terraform can also indirectly deploy containerized apps by invoking the Kubernetes API (similar to kubectl apply), but for clarity and maintainability, avoid using Terraform for application deployment in the infrastructure initialization stage.


3. Data Pipeline Automation

Why data pipeline automation is needed: Data pipeline automation makes “data supply” as repeatable, observable, and rollback-able as application delivery. It uses engineering practices to ensure stable, fast, and low-cost delivery of reliable data to BI/AI applications.

  • What is Snowflake: originally focused on data warehouse, SQL, Python support; SaaS.
  • What is Databricks: originally focused on data lake and ML; PaaS.
  • Snowflake and Databricks: both can be regarded as data warehouses in this context.
  • What is dbt: a framework that sits on top of the data warehouse. It turns warehouse data into usable data products via models, tests, and environments. Its run/test commands integrate well with CI/CD. Orchestrators (e.g., Airflow) or global CI/CD pipelines call dbt run to transform models into cleaned & modeled tables/views, and dbt test to enforce data quality.

3.1 dbt Models

Each dbt model (.sql file) is essentially executable SQL + a Jinja templating layer. A model defines the full mapping from upstream sources → transformation logic → target tables/views.

Common model types:

Type Description Examples
Staging Models Light cleaning and standardization of raw data Remove duplicates, rename fields
Intermediate Models Aggregate and join different tables joins, aggregation logic
Mart Models Final result layer for analytics/business User reports, sales KPIs

Example:

-- models/marts/sales_summary.sql
{{ config(materialized='table') }}

SELECT
    customer_id,
    SUM(amount) AS total_sales,
    COUNT(order_id) AS order_count
FROM {{ ref('stg_orders') }}
GROUP BY 1

3.2 dbt Tests

dbt supports two testing mechanisms: Generic (built-in) tests and Custom tests.

Common test types:

Test Type Purpose Example
unique Validate field uniqueness Primary key duplication check
not_null Check non-null constraints customer_id cannot be null
accepted_values Validate value domain status ∈ {active, inactive}
relationships Validate foreign keys Orders → Users

Example:

version: 2
models:
  - name: sales_summary
    tests:
      - unique:
          column_name: customer_id
      - not_null:
          column_name: customer_id

3.3 dbt Environments

dbt uses profiles.yml to manage different deployment environments (e.g., dev, staging, prod). Each environment defines connection parameters, credentials, schema, etc., to achieve environment isolation and continuous delivery.

Typical structure:

# ~/.dbt/profiles.yml
my_project:
  target: dev
  outputs:
    dev:
      type: snowflake
      account: abc123.ap-southeast-2
      user: dbt_user
      password: "{{ env_var('DBT_PASSWORD') }}"
      role: DEVELOPER
      database: ANALYTICS_DEV
      warehouse: COMPUTE_WH
      schema: DEV_SCHEMA
    prod:
      type: snowflake
      account: abc123.ap-southeast-2
      user: dbt_user
      password: "{{ env_var('DBT_PASSWORD') }}"
      role: PROD_ROLE
      database: ANALYTICS_PROD
      warehouse: COMPUTE_WH
      schema: PROD_SCHEMA

In CI/CD, you can switch the target environment dynamically via environment variables, e.g.:

dbt run --target prod

3.4 Modern Data Platform Layer Model

┌───────────────────────────────┐
│  1️⃣ Data Ingestion Layer      │
│  Data Collection Layer        │
│  Source → Collect raw data    │
│  📥 Output: Raw Data          │
│  Tools: APIs, Scripts         │
└──────────────┬────────────────┘
               │
               ▼
┌───────────────────────────────┐
│  2️⃣ Data Storage Layer        │
│  Data Storage Layer           │
│  Centralize & store data      │
│  🗄️ Output: Structured /       │
│       Semi-structured zone    │
│  Tools: S3, Snowflake,        │
│         Databricks SQL        │
└──────────────┬────────────────┘
               │
               ▼
┌───────────────────────────────┐
│  3️⃣ Data Processing Layer     │
│  Data Processing Layer        │
│  Clean, transform, standardize│
│  ⚙️ Output: Cleaned Tables     │
│  Tools: dbt                   │
└──────────────┬────────────────┘
               │
               ▼
┌───────────────────────────────┐
│  4️⃣ Data Orchestration Layer  │
│  Data Orchestration (optional)│
│  Manage ETL/ELT workflows     │
│  Automate dependencies & CI/CD│
│  🧠 Output: Reliable pipelines │
│  Tools: Airflow               │
└──────────────┬────────────────┘
               │
               ▼
┌───────────────────────────────┐
│  5️⃣ Data Consumption Layer    │
│  Data Consumption Layer       │
│  Make data “usable”           │
│  📊 Output: Reports, Dashboard│
│  Tools: Power BI, Tableau     │
└───────────────────────────────┘
─────────────────────────────────────────────
⚙️ Data Governance & Security Layer
Data governance & security — spans the entire data lifecycle
─────────────────────────────────────────────
Responsibilities:
 • Metadata management (Data Lineage / Data Catalog)
 • Data access & permission control (Access Control)
 • Data quality monitoring (Data Quality)
 • Compliance & audit (GDPR / HIPAA / ISO)
Tools:
 • DataHub, Collibra, Alation, Amundsen
 • Monte Carlo, AWS Glue Data Catalog
Output:
 🔒 Trusted, Compliant, and Secure Data
─────────────────────────────────────────────

3.5 Typical Data Pipeline Directory and Execution Order

A typical data platform directory structure:

data-platform/
├─ infra/                 # Terraform scripts to create VPC, EKS, Databricks, Snowflake Warehouse; set up IAM and Secrets; provide runtime for later modules.
├─ pipelines/             # Orchestrator scripts (Airflow DAG / Dagster job) for scheduling
│  ├─ dags/
│  └─ requirements.txt
├─ dbt/                   # dbt module (runs models and data tests; failures block release)
│  ├─ models/             # SQL models (reusable SQL modules)
│  ├─ tests/              # Data quality tests
│  └─ profiles_template.yml  # Render different env vars (dev/prod) in CI
├─ jobs/                  # Spark/Databricks/PySpark scripts (data engineering logic)
├─ ops/                   # SLO quality checks and ops scripts
├─ deploy/                # Containerized deploy scripts for dbt-runner, Airflow, Databricks/Snowflake
└─ .github/workflows/     # Global CI/CD pipelines

One-sentence summary: infra lays the foundation, deploy deploys services, pipelines schedules, jobs compute, dbt models & validates, ops ensures stable running — all chained by global .github/workflows.

flowchart LR A[infra/ <br/>Terraform environment] --> B[deploy/ <br/>Deploy Airflow/Databricks/dbt-runner] B --> C[pipelines/ <br/>Airflow/Dagster orchestration] C --> D[jobs/ <br/>Spark/Databricks computation] D --> E[dbt/ <br/>SQL models & data tests] E --> F[ops/ <br/>SLO verification / alerting / rollback] F -.Results / Metrics.-> C

3.6 MLOps (Machine Learning Operations)

Why here? MLOps bridges the data platform (Section 3) and platform operations (Section 4) by enabling reliable, observable, and secure model delivery to production.

3.6.1 Overview

MLOps extends DevOps practices to machine learning systems so models can be continuously trained, deployed, monitored, and improved in production.

It integrates three domains:

Domain Responsibility Example Tools
ML Development (Data Science) Model design, training, experimentation Jupyter, PyTorch, TensorFlow, Hugging Face
Data Engineering Data ingestion, feature pipelines Airflow, dbt, Spark, Databricks
Platform Engineering / DevOps CI/CD, infrastructure, observability GitHub Actions, Kubernetes, MLflow, Seldon Core

3.6.2 MLOps Lifecycle

flowchart LR A[Data Collection] --> B[Feature Engineering] B --> C[Model Training] C --> D[Model Evaluation] D --> E[Model Registry] E --> F[Model Deployment] F --> G[Model Monitoring] G --> H[Feedback Loop → Retraining]
Data & Feature Engineering
  • Reproducible data pipelines; Feature Store (Feast/Databricks).
  • Data versioning via DVC/LakeFS; schema contracts to avoid breaking changes.
Model Training & Experiment Tracking
  • Track experiments with MLflow or Weights & Biases.
  • Store artifacts centrally (e.g., S3 + MLflow backend).
  • Automate training via scheduled DAGs or CI jobs.
Model Registry
  • Register model versions with metadata (owner, metrics, timestamp).
  • Examples: MLflow Registry, SageMaker Model Registry.
Model Deployment
  • Batch inference: Spark/Databricks jobs, AWS Batch.
  • Online inference: FastAPI on EKS, Lambda, SageMaker Endpoint, Seldon Core.
  • Use canary / shadow deploys for safe rollout.
Model Monitoring
  • Track prediction/concept drift, latency, and error rates.
  • Emit metrics to CloudWatch / Prometheus, traces/logs to Dynatrace.
Continuous Retraining
  • Trigger retraining on drift thresholds or SLO breaches.
  • Push new model to registry → gated deploy with approvals.

3.6.3 MLOps CI/CD Example

Typical steps:

1️⃣ Git Commit → code/data versioning (Git + DVC)
2️⃣ CI → unit tests, data validation, training
3️⃣ Store artifacts → MLflow / S3
4️⃣ CD → deploy serving (API/Batch)
5️⃣ Monitor → metrics, drift detection
6️⃣ Feedback → retrain on trigger

Minimal GitHub Actions sketch:

name: mlops-ci-cd
on: [push]
jobs:
  build-train-deploy:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - name: Train model
        run: python train.py
      - name: Register model
        run: mlflow register --model-path outputs/model.pkl
      - name: Deploy model
        run: python deploy.py --env prod

3.6.4 Integration with Platform Engineering

Platform Component MLOps Integration
Infrastructure Automation Provision GPU nodes, S3 buckets, SageMaker/Databricks via Terraform
Data Pipeline Automation Supply versioned, validated datasets for model training
Operations & Stability Apply SLOs to ML APIs and batch jobs; monitor accuracy & drift
Security & Compliance Encrypt model artifacts, restrict IAM to registries, ensure GDPR/PII compliance


4. Platform Operations and Stability

4.1 Concepts: SLO and Observability

SLO (Service Level Objective) represents the target performance or availability expected of a service — a core metric for platform stability.

Common dimensions:

Metric Category Example Metric Example Target
Availability API success rate ≥ 99.9%
Performance P95 latency < 500 ms
Correctness Job success rate ≥ 98%
Cost Efficiency Resource utilization ≥ 80%

Observability refers to the system’s ability to be observed, understood, and diagnosed. It includes three core pillars:

  • Metrics: e.g., latency, CPU usage, throughput
  • Logs: event records from systems and applications
  • Traces: cross-service request tracing (distributed call analysis)

4.2 Monitoring & Alerting — Tools and Usage

When a deployment fails or SLOs are not met, trigger alerts. For example:

Condition:
  IF error_rate > 0.5% OR latency_p95 > 800ms
  THEN trigger alarm → SNS topic → Slack channel #platform-alerts

With CloudWatch Alarm Actions, you can invoke Lambda to automatically:

  • Perform rollbacks;
  • Restart containers;
  • Notify on-call engineers;
  • File an incident ticket.

Besides AWS CloudWatch, you can use tools like Dynatrace for deeper operational data analysis and incident investigation.

4.3 Incident Management & Optimization

Goal Key Practices
Detection coverage Ensure probes exist across Infra / App / Data pipeline layers
False-positive control Use multiple metrics + time windows (e.g., continuous 5 minutes)
Automated response CloudWatch Alarm + Lambda + Runbook
Incident tracking Auto-generate tickets / Jira records
Root Cause Analysis (RCA) Use Dynatrace to auto-correlate logs and traces

For releases, you can use canary release with partial rollbacks as needed.

4.4 Cloud Cost Governance (Cost Hygiene Principle)

  • Tagging: tag infra resources to track cost ownership;
  • Rightsizing: size resources appropriately to avoid waste;
  • Lifecycle Policies: clean up by lifecycle to prevent accumulation.

Concrete actions:

1️⃣ Tagging — Automated Detection****Goal: Ensure all EC2, S3, EBS, RDS, Lambda, etc., have necessary tags: Project, Environment, Owner, CostCenter.Tooling:

  • AWS Config Rules — detect missing tags.
  • AWS Tag Editor — central view and bulk tagging.
  • Custom Lambda scripts — auto-detect and notify owners.

Example (Python + boto3):

import boto3
ec2 = boto3.client('ec2')
instances = ec2.describe_instances()['Reservations']
for r in instances:
    for i in r['Instances']:
        tags = {t['Key']: t['Value'] for t in i.get('Tags', [])}
        if 'Project' not in tags or 'Owner' not in tags:
            print(f"[WARN] Instance {i['InstanceId']} missing tags: {tags}")
# Can be run periodically or as AWS Lambda + CloudWatch Event to scan daily.

**2️⃣ Rightsizing (Resource Optimization)**Goal: Identify over-provisioned or long-idle resources (e.g., EC2 CPU utilization < 10% for 7 consecutive days).Tooling:

  • AWS Compute Optimizer — official recommendation; analyzes EC2/EBS/Lambda loads and proposes adjustments.
  • AWS Trusted Advisor — cost & security recommendations (includes idle resource checks).
  • Custom CloudWatch Alarm — alert on low CPU/Mem thresholds.

Example (flag idle instances):

aws cloudwatch get-metric-statistics \
  --metric-name CPUUtilization \
  --start-time 2025-10-10T00:00:00Z \
  --end-time 2025-10-12T00:00:00Z \
  --period 86400 \
  --namespace AWS/EC2 \
  --statistics Average \
  --dimensions Name=InstanceId,Value=i-1234567890abcdef
# → Write results to a Lambda to decide auto-shutdown.

**3️⃣ Lifecycle Policies (Periodic Cleanup)**Goal: Regularly clean stale data, logs, snapshots, and temp resources.Tooling:

  • S3 Lifecycle Policy — auto-transition/delete old objects.
  • EBS Snapshot Lifecycle Manager (DLM) — auto-delete expired snapshots.
  • Terraform Lifecycle Rules — control destroy/keep logic.
  • Scheduled Lambda cleanup — remove untagged temp resources (e.g., “test”, “tmp”).

Example (S3 lifecycle policy):

{
  "Rules": [
    {
      "ID": "DeleteOldLogs",
      "Prefix": "logs/",
      "Status": "Enabled",
      "Expiration": {"Days": 30}
    }
  ]
}

Past practiceIn a previous company, even with ample resources, I still applied conservatively, following:

  • Apply on Demand: request resources only when truly needed to avoid idling;
  • Reuse Before Apply: prefer reusing existing environments/instances;
  • Utilization-informed strategy: summarize utilization from the last deployment to guide the next request.

5. Security & Compliance

5.1 IAM and Least-Privilege Access

5.1.1 IAM (Identity and Access Management)

IAM (Identity and Access Management) is AWS’s identity and access control system. Its core goal is:

“Ensure the right people and services, at the right time, can access only the right resources.”

In other words, IAM determines:

  • Who (Role) can access;
  • What (Resource) can be accessed;
  • Under what conditions (Condition);
  • Which actions (Action) are permitted.

5.1.2 Policy Example

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": "s3:PutObject",
      "Resource": "arn:aws:s3:::temporary-upload-bucket/*",
      "Condition": {
        "DateLessThan": {
          "aws:CurrentTime": "2025-10-13T00:00:00Z"
        }
      }
    }
  ]
}

5.1.3 RBAC Model & Least Privilege

AWS IAM follows a role-based access control (RBAC) model, where each policy must be attached to an IAM entity such as an IAM account, IAM user group, IAM user, or IAM role.
An IAM Role is a temporary identity that can be assumed by AWS services, users, or external systems to obtain specific permissions for a limited time.
The Least Privilege Principle ensures that each IAM entity has only the minimal permissions required to perform its tasks.

5.1.4 Policy Layers

Policy Type Chinese Name Scope Typical Scenarios Characteristics
Account Policy (Global) 账户级策略 Affects all users/resources under an AWS account or organization Unified security baseline/global restrictions (e.g., disallow deleting CloudTrail, disabling billing monitoring) Implemented viaService Control Policies (SCP) in AWS Organizations by admins
Group Policy (Team-level) 用户组级策略 Applied to a user group; uniform permissions for members Grant Dev group “S3 read/write + EC2 start/stop”; Finance group “view billing” Facilitates team-wide authorization; members inherit automatically
Role Policy (Service-level) 角色级策略 Granted to services (Lambda/EC2/EKS Pod/CI-CD tools) or cross-account access Lambda accessing DynamoDB, EC2 pushing logs to S3, GitHub Actions deploying to AWS Formachine identities / temporary access (AssumeRole); follow least privilege
Inline Policy (Exception-level) 内联策略 Embedded directly in a specific user/group/role Temporary emergency grants, short-term tests/special tasks (e.g., open S3 upload for one day) Same lifecycle as bound object; removed on deletion; not reusable

In one sentence: Account Policies control what the entire account can do; Group Policies control what groups of people can do; Role Policies control what a service/system can do; Inline Policies are for short-term exceptions or temporary grants.

5.1.5 ARN Components

ARN (Amazon Resource Name) is the unique identifier for an AWS resource.

Part Meaning Example
arn Fixed prefix indicating an ARN arn
partition AWS partition (usually aws) aws (global) / aws-cn (China) / aws-us-gov (GovCloud)
service Service name s3, ec2, iam, lambda, dynamodb
region Region code ap-southeast-2 (Sydney)
account-id AWS account ID (12 digits) 123456789012
resource Specific resource name/path bucket-name, instance/i-12345, role/MyLambdaRole

5.1.6 ARN Examples

Service Example ARN Description
S3 Bucket arn:aws:s3:::my-data-bucket Entire bucket
S3 Object arn:aws:s3:::my-data-bucket/images/photo.jpg Specific object in bucket
EC2 Instance arn:aws:ec2:ap-southeast-2:123456789012:instance/i-0abc1234def567890 A VM instance
Lambda Function arn:aws:lambda:ap-southeast-2:123456789012:function:MyFunction A function
IAM Role arn:aws:iam::123456789012:role/MyLambdaRole An IAM role
DynamoDB Table arn:aws:dynamodb:ap-southeast-2:123456789012:table/Users A table

5.1.7 ARN in Policies

In the Resource field of an IAM policy, the ARN specifies the target resource to allow/deny.

{
  "Effect": "Allow",
  "Action": "s3:GetObject",
  "Resource": "arn:aws:s3:::my-data-bucket/*"
}

Meaning: allow reading all objects under the my-data-bucket S3 bucket.

5.1.8 Wildcards in ARNs

Sometimes you need to match a class of resources using *:

"Resource": "arn:aws:lambda:ap-southeast-2:123456789012:function:*"

This means: allow access to all Lambda functions in the current account within that region.

5.1.9 Policy Reviews

  • ✅ Use IAM Access Analyzer to detect over-privileged permissions
  • ✅ Use CloudTrail to audit all IAM change operations

5.1.10 Past Practice

  • In previous internships, the initial approach was direct access from a local program to AWS Bedrock, which made approval difficult.
  • Later, calling Bedrock via AWS Lambda reduced permission requirements and achieved the goal.

5.2 Secrets Management

5.2.1 Management Method

AWS Secrets Manager + AWS KMS (Key Management Service)

5.2.2 Secret Retrieval Flow

sequenceDiagram participant App as Application participant SM as AWS Secrets Manager participant KMS as AWS KMS participant DB as Database App->>SM: Request secret (GetSecretValue) SM->>KMS: Decrypt stored secret KMS-->>SM: Return plaintext credentials SM-->>App: Return credentials (JSON) App->>DB: Use credentials to establish connection

Note: Secrets Manager can enable automatic rotation, generating a Rotation Lambda Function to periodically create new passwords.

5.2.3 Secrets Manager Storage Structure (Example)

{
  "username": "admin_user",
  "password": "N9pLxkK4Q7z!",
  "engine": "mysql",
  "host": "db-prod.cluster-xxxx.ap-southeast-2.rds.amazonaws.com",
  "port": 3306,
  "dbname": "production"
}

5.2.4 Access Control for Secrets (IAM + ARN)

Each secret has a unique ARN, for example:
arn:aws:secretsmanager:ap-southeast-2:123456789012:secret:prod/db-password-AbCdE

To access that db-password, a user or service must have a policy like:

{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Action": [
      "secretsmanager:GetSecretValue"
    ],
    "Resource": "arn:aws:secretsmanager:ap-southeast-2:123456789012:secret:prod/db-password-AbCdE"
  }]
}

5.2.5 Past Practice

  • GitHub Secrets;
  • Use of self-hosted enterprise-grade KMS.

5.3 Encryption in Transit / At Rest

5.3.1 Past Practice

  • Frontend username/password transmitted with MD5; Flask salted; ciphertext stored in DB;
  • TLS/SSL (HTTPS) to encrypt site access;
  • Encrypted protocols between services (e.g., Remote Desktop, vmess, ssr);
  • Application layer: VPN.

5.4 Network Controls

5.4.1 Past Practice

  • Subnets: initially designed subnets + bastion for the alcohol auditor;
  • Security Groups: when using OCI, configured accessible IPs and traffic ports;
  • Linux firewall: for personal sites, only opened necessary service ports; others closed;
  • Microservice traffic control: used Spring Cloud Gateway and Alibaba Sentinel for rate limiting.

5.5 Vulnerability Remediation

5.5.1 Past Practice

  • Package scanning + manual review;
  • Monitor official security bulletins (e.g., CNNVD), rate risks, and report;
  • Prepare upgrade plans: set explicit upgrade dates and email stakeholders.

5.6 Audit Support

5.6.1 Past Practice

  • At ICBC: recorded vulnerability remediation, patch updates, and risk closure;
  • Submitted infrastructure & configuration change history (CSV files) for audit.