My Understading on Platform Engineer Role
Patform Engineering — Quick Notes
In my opinion, Platform Engineer has 3 major responsibilities:
- Automation (CI/CD Workflow): basic application automation, infrastructure automation, data pipeline automation, AI deployement automation
- Platform Operation Stability: monitoring and alerting, quality control (SLO), rollback strategy, cost hygiene.
- Access Control, Security and Compliance: Least Privilege IAM practice, Secrets Management, Encryption, Network Access Control, Audit Support, Vulnerability Remediation.
This document reorganizes the notes into structured sections for clarity and readability.
Note:
I am currently already familiar with basic application automation and infrastructure automation.
I was less familiar with data pipeline automation but have learned it during the weekends.
Correct me if I'm wrong!
1. Basic Application Automation — Basic APP CI/CD Workflow
2. Infrastructure Automation
2.1 Automation Approaches
There are two ways to implement infrastructure automation: a visual drag-and-drop orchestration approach (e.g., AWS Step Functions, Azure Logic Apps) and an Infrastructure as Code (IaC) approach using Terraform.
In general, during the exploratory development phase, you can prioritize the drag-and-drop approach to quickly assemble, validate logic, and adjust dependencies; when workflows and architecture have stabilized, use Terraform for code-based management.
Before doing infrastructure automation, you need a basic understanding of commonly used AWS services and their attributes. For example:
- When creating an EC2 instance, pay attention to attributes such as subnet, instance type (CPU/memory), security group, and key pair.
- When creating an S3 bucket, focus on access control (ACL/Policy), encryption, versioning, and lifecycle rules.
In addition, whether it is EC2, S3, or other resources, when creating them you must configure permissions (for access control) and tagging (for cost hygiene). This will be discussed in detail in subsequent chapters.
2.2 AWS Service Attribute Quick Reference (Updated)
| Layer | Service | Main Role | Description / Dependency Logic | Frequently Adjusted Attributes |
|---|---|---|---|---|
| 1. Network Foundation Layer | VPC (Virtual Private Cloud) | Define network boundaries (CIDR, DNS, subnets) | Foundation for all resources; must be created first | cidr_block, enable_dns_hostnames, enable_dns_support, tags |
| Subnet | Network segmentation (public/private) | Depends on VPC; determines whether resources can access the public Internet | vpc_id, cidr_block, availability_zone, map_public_ip_on_launch, tags |
|
| Security Group | Network access control | Depends on VPC; controls EC2, RDS, ECS traffic | ingress, egress, description, tags |
|
| Route 53 | DNS resolution | Can connect to CloudFront, ALB, or API Gateway | records, ttl, alias, health_check_id |
|
| 2. Storage & Encryption Layer | S3 (Simple Storage Service) | Object storage | Can be used for static websites, logs, backup sources | acl, versioning, lifecycle_rule, public_access_block, server_side_encryption_configuration, tags |
| EBS (Elastic Block Store) | EC2 block storage | Attached to EC2 instances | size, type, iops, encrypted, kms_key_id, tags |
|
| EFS (Elastic File System) | Shared file system | Mountable to multiple EC2 / ECS | performance_mode, throughput_mode, lifecycle_policy, encrypted, mount_targets |
|
| KMS (Key Management Service) | Key management & encryption | Used by S3 / EBS / RDS / CloudTrail | enable_key_rotation, policy, deletion_window_in_days, tags |
|
| 3. Identity & Access Layer | IAM Role / Policy | Permissions & access control | Required by all services (EC2, Lambda, Terraform) | assume_role_policy, inline_policies, managed_policy_arns, tags |
| 4. Compute & Container Layer | EC2 (Elastic Compute Cloud) | Virtual machine instances | Depends on VPC, Subnet, SG, IAM, EBS | ami, instance_type, key_name, subnet_id, vpc_security_group_ids, user_data, tags |
| ECS / EKS | Container orchestration | Depends on EC2 / Fargate, VPC, SG, IAM | cluster_name, task_definition, service, network_configuration, execution_role_arn |
|
| Lambda | Serverless functions | Depends on IAM Role, VPC (optional) | runtime, handler, timeout, memory_size, environment, vpc_config, source_code_hash, tags |
|
| 5. Data & Messaging Layer | RDS (Relational Database Service) | Managed database | Depends on VPC, Subnet, SG, KMS (encryption) | engine_version, instance_class, storage_type, multi_az, backup_retention_period, vpc_security_group_ids, tags |
| DynamoDB (NoSQL Database) | Key-value / document database | Serverless NoSQL; often used by Lambda, Step Functions, Terraform backend locking | hash_key, range_key, billing_mode, read_capacity, write_capacity, stream_enabled, ttl, tags |
|
| SQS (Simple Queue Service) | Queue-based communication | Asynchronous decoupling for Lambda / ECS | delay_seconds, message_retention_seconds, visibility_timeout_seconds, fifo_queue, tags |
|
| SNS (Simple Notification Service) | Message notifications | Works with CloudWatch / Lambda / SQS | topic_name, subscription, delivery_policy, tags |
|
| 6. Network & Distribution Layer | Load Balancer (ALB/NLB) | Traffic distribution | Depends on Subnet, SG; connects to EC2 / ECS | listener, target_group, health_check, subnets, security_groups, tags |
| CloudFront (CDN) | Global content distribution | Depends on S3 / ALB / ACM certificates | origin, default_cache_behavior, viewer_certificate, aliases, enabled, tags |
|
| 7. Monitoring & Logging Layer | CloudWatch | Monitoring & alerting | Depends on SNS; collects logs from Lambda / ECS / EC2 | metric_alarm, threshold, retention_in_days, comparison_operator, sns_topic_arn |
| CloudTrail | Audit logs | Depends on S3, KMS; tracks API operations | is_multi_region_trail, s3_bucket_name, enable_log_file_validation, cloud_watch_logs_group_arn, tags |
|
| 8. IaC & Automation Layer | CloudFormation / Terraform Backend | Infrastructure as Code (IaC) | Depends on S3 (state storage) + DynamoDB (locking) | bucket, dynamodb_table, region, encrypt, kms_key_id |
| 9. Image & Artifact Layer | ECR (Elastic Container Registry) | Image registry | Used by ECS / EKS to pull images | image_scanning_configuration, encryption_configuration, lifecycle_policy, repository_name, tags |
2.3 Terraform Basics
2.3.1 Overview
Terraform is an Infrastructure as Code (IaC) tool that acts as a translator between your configuration files (written in HashiCorp Configuration Language, HCL) and the APIs of various cloud providers (like AWS, Azure, GCP, OCI). Terraform configurations are made up of several key components:
| Concept | Description |
|---|---|
| Provider | A plugin provided by cloud vendors (e.g., AWS, OCI, Azure) that allows Terraform to communicate with their APIs — essentially the translator between Terraform and the cloud. |
| Resource | The actual cloud objects defined in Terraform, such as compute instances, storage buckets, or databases. |
| Output | Used to display or pass results after Terraform execution — for example, the public IP address of an instance or a database endpoint. |
2.3.2 Example: Resource Definition (OCI Instance)
resource "oci_core_instance" "arm_instance" {
availability_domain = var.availability_domain
compartment_id = var.tenancy_ocid # Free-tier users usually have only the root compartment, whose ID is equal to tenancy OCID
shape = var.instance_shape
shape_config {
ocpus = var.ocpus
memory_in_gbs = var.memory_in_gbs
}
create_vnic_details {
subnet_id = var.subnet_id
assign_public_ip = true
}
source_details {
source_type = var.instance_source_type
source_id = var.source_id
boot_volume_size_in_gbs = var.boot_volume_size_in_gbs
}
metadata = {
ssh_authorized_keys = var.ssh_authorized_keys
}
}
2.3.3 Terraform Execution Workflow
A typical Terraform workflow consists of the following steps:
terraform init
-> terraform plan
-> terraform apply
├─ ① Retrieve current state (from local or backend)
├─ ② Lock the state (state locking)
├─ ③ Apply changes (create / update / destroy)
├─ ④ Write updated state file
└─ ⑤ Unlock state
In collaborative or production environments, it is strongly recommended to store Terraform state in a remote backend with state locking enabled.AWS officially recommends using the following combination:
- S3 backend → stores the Terraform state file (state)
- DynamoDB → provides locking to prevent concurrent modifications
2.3.4 Terraform Modularization
root/
├── main.tf # Root module (invokes submodules)
├── variables.tf
├── outputs.tf
└── modules/
└── ec2_instance/ # Submodule
├── main.tf
├── variables.tf
└── outputs.tf
2.3.5 EKS Cluster and Infrastructure Automation
The creation of an EKS cluster is also part of infrastructure automation. In an EKS environment, Terraform is typically responsible for:
- Setting up the network environment for EKS (VPC, subnets, security groups);
- Creating Node Group instances for running Pods;
- Managing IAM roles and policies;
- Configuring monitoring components (e.g., CloudWatch).
After deployment, what you get is essentially an empty Kubernetes operating system — Terraform provisions the environment, but it does not control what runs inside the cluster.
Kubernetes, on the other hand, uses YAML manifests to define what runs within the cluster:
- Which application (container image) to run
- How many replicas to deploy
- Which ports to expose and monitor
- Which configuration or storage resources to attach
These application-level details are outside Terraform’s scope.
Note: Terraform can also indirectly deploy containerized apps by invoking the Kubernetes API (similar to
kubectl apply), but for clarity and maintainability, avoid using Terraform for application deployment in the infrastructure initialization stage.
3. Data Pipeline Automation
Why data pipeline automation is needed: Data pipeline automation makes “data supply” as repeatable, observable, and rollback-able as application delivery. It uses engineering practices to ensure stable, fast, and low-cost delivery of reliable data to BI/AI applications.
- What is Snowflake: originally focused on data warehouse, SQL, Python support; SaaS.
- What is Databricks: originally focused on data lake and ML; PaaS.
- Snowflake and Databricks: both can be regarded as data warehouses in this context.
- What is dbt: a framework that sits on top of the data warehouse. It turns warehouse data into usable data products via models, tests, and environments. Its run/test commands integrate well with CI/CD. Orchestrators (e.g., Airflow) or global CI/CD pipelines call
dbt runto transform models into cleaned & modeled tables/views, anddbt testto enforce data quality.
3.1 dbt Models
Each dbt model (.sql file) is essentially executable SQL + a Jinja templating layer. A model defines the full mapping from upstream sources → transformation logic → target tables/views.
Common model types:
| Type | Description | Examples |
|---|---|---|
| Staging Models | Light cleaning and standardization of raw data | Remove duplicates, rename fields |
| Intermediate Models | Aggregate and join different tables | joins, aggregation logic |
| Mart Models | Final result layer for analytics/business | User reports, sales KPIs |
Example:
-- models/marts/sales_summary.sql
{{ config(materialized='table') }}
SELECT
customer_id,
SUM(amount) AS total_sales,
COUNT(order_id) AS order_count
FROM {{ ref('stg_orders') }}
GROUP BY 1
3.2 dbt Tests
dbt supports two testing mechanisms: Generic (built-in) tests and Custom tests.
Common test types:
| Test Type | Purpose | Example |
|---|---|---|
unique |
Validate field uniqueness | Primary key duplication check |
not_null |
Check non-null constraints | customer_id cannot be null |
accepted_values |
Validate value domain | status ∈ {active, inactive} |
relationships |
Validate foreign keys | Orders → Users |
Example:
version: 2
models:
- name: sales_summary
tests:
- unique:
column_name: customer_id
- not_null:
column_name: customer_id
3.3 dbt Environments
dbt uses profiles.yml to manage different deployment environments (e.g., dev, staging, prod). Each environment defines connection parameters, credentials, schema, etc., to achieve environment isolation and continuous delivery.
Typical structure:
# ~/.dbt/profiles.yml
my_project:
target: dev
outputs:
dev:
type: snowflake
account: abc123.ap-southeast-2
user: dbt_user
password: "{{ env_var('DBT_PASSWORD') }}"
role: DEVELOPER
database: ANALYTICS_DEV
warehouse: COMPUTE_WH
schema: DEV_SCHEMA
prod:
type: snowflake
account: abc123.ap-southeast-2
user: dbt_user
password: "{{ env_var('DBT_PASSWORD') }}"
role: PROD_ROLE
database: ANALYTICS_PROD
warehouse: COMPUTE_WH
schema: PROD_SCHEMA
In CI/CD, you can switch the target environment dynamically via environment variables, e.g.:
dbt run --target prod
3.4 Modern Data Platform Layer Model
┌───────────────────────────────┐
│ 1️⃣ Data Ingestion Layer │
│ Data Collection Layer │
│ Source → Collect raw data │
│ 📥 Output: Raw Data │
│ Tools: APIs, Scripts │
└──────────────┬────────────────┘
│
▼
┌───────────────────────────────┐
│ 2️⃣ Data Storage Layer │
│ Data Storage Layer │
│ Centralize & store data │
│ 🗄️ Output: Structured / │
│ Semi-structured zone │
│ Tools: S3, Snowflake, │
│ Databricks SQL │
└──────────────┬────────────────┘
│
▼
┌───────────────────────────────┐
│ 3️⃣ Data Processing Layer │
│ Data Processing Layer │
│ Clean, transform, standardize│
│ ⚙️ Output: Cleaned Tables │
│ Tools: dbt │
└──────────────┬────────────────┘
│
▼
┌───────────────────────────────┐
│ 4️⃣ Data Orchestration Layer │
│ Data Orchestration (optional)│
│ Manage ETL/ELT workflows │
│ Automate dependencies & CI/CD│
│ 🧠 Output: Reliable pipelines │
│ Tools: Airflow │
└──────────────┬────────────────┘
│
▼
┌───────────────────────────────┐
│ 5️⃣ Data Consumption Layer │
│ Data Consumption Layer │
│ Make data “usable” │
│ 📊 Output: Reports, Dashboard│
│ Tools: Power BI, Tableau │
└───────────────────────────────┘
─────────────────────────────────────────────
⚙️ Data Governance & Security Layer
Data governance & security — spans the entire data lifecycle
─────────────────────────────────────────────
Responsibilities:
• Metadata management (Data Lineage / Data Catalog)
• Data access & permission control (Access Control)
• Data quality monitoring (Data Quality)
• Compliance & audit (GDPR / HIPAA / ISO)
Tools:
• DataHub, Collibra, Alation, Amundsen
• Monte Carlo, AWS Glue Data Catalog
Output:
🔒 Trusted, Compliant, and Secure Data
─────────────────────────────────────────────
3.5 Typical Data Pipeline Directory and Execution Order
A typical data platform directory structure:
data-platform/
├─ infra/ # Terraform scripts to create VPC, EKS, Databricks, Snowflake Warehouse; set up IAM and Secrets; provide runtime for later modules.
├─ pipelines/ # Orchestrator scripts (Airflow DAG / Dagster job) for scheduling
│ ├─ dags/
│ └─ requirements.txt
├─ dbt/ # dbt module (runs models and data tests; failures block release)
│ ├─ models/ # SQL models (reusable SQL modules)
│ ├─ tests/ # Data quality tests
│ └─ profiles_template.yml # Render different env vars (dev/prod) in CI
├─ jobs/ # Spark/Databricks/PySpark scripts (data engineering logic)
├─ ops/ # SLO quality checks and ops scripts
├─ deploy/ # Containerized deploy scripts for dbt-runner, Airflow, Databricks/Snowflake
└─ .github/workflows/ # Global CI/CD pipelines
One-sentence summary: infra lays the foundation, deploy deploys services, pipelines schedules, jobs compute, dbt models & validates, ops ensures stable running — all chained by global .github/workflows.
3.6 MLOps (Machine Learning Operations)
Why here? MLOps bridges the data platform (Section 3) and platform operations (Section 4) by enabling reliable, observable, and secure model delivery to production.
3.6.1 Overview
MLOps extends DevOps practices to machine learning systems so models can be continuously trained, deployed, monitored, and improved in production.
It integrates three domains:
| Domain | Responsibility | Example Tools |
|---|---|---|
| ML Development (Data Science) | Model design, training, experimentation | Jupyter, PyTorch, TensorFlow, Hugging Face |
| Data Engineering | Data ingestion, feature pipelines | Airflow, dbt, Spark, Databricks |
| Platform Engineering / DevOps | CI/CD, infrastructure, observability | GitHub Actions, Kubernetes, MLflow, Seldon Core |
3.6.2 MLOps Lifecycle
Data & Feature Engineering
- Reproducible data pipelines; Feature Store (Feast/Databricks).
- Data versioning via DVC/LakeFS; schema contracts to avoid breaking changes.
Model Training & Experiment Tracking
- Track experiments with MLflow or Weights & Biases.
- Store artifacts centrally (e.g., S3 + MLflow backend).
- Automate training via scheduled DAGs or CI jobs.
Model Registry
- Register model versions with metadata (owner, metrics, timestamp).
- Examples: MLflow Registry, SageMaker Model Registry.
Model Deployment
- Batch inference: Spark/Databricks jobs, AWS Batch.
- Online inference: FastAPI on EKS, Lambda, SageMaker Endpoint, Seldon Core.
- Use canary / shadow deploys for safe rollout.
Model Monitoring
- Track prediction/concept drift, latency, and error rates.
- Emit metrics to CloudWatch / Prometheus, traces/logs to Dynatrace.
Continuous Retraining
- Trigger retraining on drift thresholds or SLO breaches.
- Push new model to registry → gated deploy with approvals.
3.6.3 MLOps CI/CD Example
Typical steps:
1️⃣ Git Commit → code/data versioning (Git + DVC)
2️⃣ CI → unit tests, data validation, training
3️⃣ Store artifacts → MLflow / S3
4️⃣ CD → deploy serving (API/Batch)
5️⃣ Monitor → metrics, drift detection
6️⃣ Feedback → retrain on trigger
Minimal GitHub Actions sketch:
name: mlops-ci-cd
on: [push]
jobs:
build-train-deploy:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Train model
run: python train.py
- name: Register model
run: mlflow register --model-path outputs/model.pkl
- name: Deploy model
run: python deploy.py --env prod
3.6.4 Integration with Platform Engineering
| Platform Component | MLOps Integration |
|---|---|
| Infrastructure Automation | Provision GPU nodes, S3 buckets, SageMaker/Databricks via Terraform |
| Data Pipeline Automation | Supply versioned, validated datasets for model training |
| Operations & Stability | Apply SLOs to ML APIs and batch jobs; monitor accuracy & drift |
| Security & Compliance | Encrypt model artifacts, restrict IAM to registries, ensure GDPR/PII compliance |
4. Platform Operations and Stability
4.1 Concepts: SLO and Observability
SLO (Service Level Objective) represents the target performance or availability expected of a service — a core metric for platform stability.
Common dimensions:
| Metric Category | Example Metric | Example Target |
|---|---|---|
| Availability | API success rate | ≥ 99.9% |
| Performance | P95 latency | < 500 ms |
| Correctness | Job success rate | ≥ 98% |
| Cost Efficiency | Resource utilization | ≥ 80% |
Observability refers to the system’s ability to be observed, understood, and diagnosed. It includes three core pillars:
- Metrics: e.g., latency, CPU usage, throughput
- Logs: event records from systems and applications
- Traces: cross-service request tracing (distributed call analysis)
4.2 Monitoring & Alerting — Tools and Usage
When a deployment fails or SLOs are not met, trigger alerts. For example:
Condition:
IF error_rate > 0.5% OR latency_p95 > 800ms
THEN trigger alarm → SNS topic → Slack channel #platform-alerts
With CloudWatch Alarm Actions, you can invoke Lambda to automatically:
- Perform rollbacks;
- Restart containers;
- Notify on-call engineers;
- File an incident ticket.
Besides AWS CloudWatch, you can use tools like Dynatrace for deeper operational data analysis and incident investigation.
4.3 Incident Management & Optimization
| Goal | Key Practices |
|---|---|
| Detection coverage | Ensure probes exist across Infra / App / Data pipeline layers |
| False-positive control | Use multiple metrics + time windows (e.g., continuous 5 minutes) |
| Automated response | CloudWatch Alarm + Lambda + Runbook |
| Incident tracking | Auto-generate tickets / Jira records |
| Root Cause Analysis (RCA) | Use Dynatrace to auto-correlate logs and traces |
For releases, you can use canary release with partial rollbacks as needed.
4.4 Cloud Cost Governance (Cost Hygiene Principle)
- Tagging: tag infra resources to track cost ownership;
- Rightsizing: size resources appropriately to avoid waste;
- Lifecycle Policies: clean up by lifecycle to prevent accumulation.
Concrete actions:
1️⃣ Tagging — Automated Detection****Goal: Ensure all EC2, S3, EBS, RDS, Lambda, etc., have necessary tags: Project, Environment, Owner, CostCenter.Tooling:
- AWS Config Rules — detect missing tags.
- AWS Tag Editor — central view and bulk tagging.
- Custom Lambda scripts — auto-detect and notify owners.
Example (Python + boto3):
import boto3
ec2 = boto3.client('ec2')
instances = ec2.describe_instances()['Reservations']
for r in instances:
for i in r['Instances']:
tags = {t['Key']: t['Value'] for t in i.get('Tags', [])}
if 'Project' not in tags or 'Owner' not in tags:
print(f"[WARN] Instance {i['InstanceId']} missing tags: {tags}")
# Can be run periodically or as AWS Lambda + CloudWatch Event to scan daily.
**2️⃣ Rightsizing (Resource Optimization)**Goal: Identify over-provisioned or long-idle resources (e.g., EC2 CPU utilization < 10% for 7 consecutive days).Tooling:
- AWS Compute Optimizer — official recommendation; analyzes EC2/EBS/Lambda loads and proposes adjustments.
- AWS Trusted Advisor — cost & security recommendations (includes idle resource checks).
- Custom CloudWatch Alarm — alert on low CPU/Mem thresholds.
Example (flag idle instances):
aws cloudwatch get-metric-statistics \
--metric-name CPUUtilization \
--start-time 2025-10-10T00:00:00Z \
--end-time 2025-10-12T00:00:00Z \
--period 86400 \
--namespace AWS/EC2 \
--statistics Average \
--dimensions Name=InstanceId,Value=i-1234567890abcdef
# → Write results to a Lambda to decide auto-shutdown.
**3️⃣ Lifecycle Policies (Periodic Cleanup)**Goal: Regularly clean stale data, logs, snapshots, and temp resources.Tooling:
- S3 Lifecycle Policy — auto-transition/delete old objects.
- EBS Snapshot Lifecycle Manager (DLM) — auto-delete expired snapshots.
- Terraform Lifecycle Rules — control destroy/keep logic.
- Scheduled Lambda cleanup — remove untagged temp resources (e.g., “test”, “tmp”).
Example (S3 lifecycle policy):
{
"Rules": [
{
"ID": "DeleteOldLogs",
"Prefix": "logs/",
"Status": "Enabled",
"Expiration": {"Days": 30}
}
]
}
Past practiceIn a previous company, even with ample resources, I still applied conservatively, following:
- Apply on Demand: request resources only when truly needed to avoid idling;
- Reuse Before Apply: prefer reusing existing environments/instances;
- Utilization-informed strategy: summarize utilization from the last deployment to guide the next request.
5. Security & Compliance
5.1 IAM and Least-Privilege Access
5.1.1 IAM (Identity and Access Management)
IAM (Identity and Access Management) is AWS’s identity and access control system. Its core goal is:
“Ensure the right people and services, at the right time, can access only the right resources.”
In other words, IAM determines:
- Who (Role) can access;
- What (Resource) can be accessed;
- Under what conditions (Condition);
- Which actions (Action) are permitted.
5.1.2 Policy Example
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": "s3:PutObject",
"Resource": "arn:aws:s3:::temporary-upload-bucket/*",
"Condition": {
"DateLessThan": {
"aws:CurrentTime": "2025-10-13T00:00:00Z"
}
}
}
]
}
5.1.3 RBAC Model & Least Privilege
AWS IAM follows a role-based access control (RBAC) model, where each policy must be attached to an IAM entity such as an IAM account, IAM user group, IAM user, or IAM role.
An IAM Role is a temporary identity that can be assumed by AWS services, users, or external systems to obtain specific permissions for a limited time.
The Least Privilege Principle ensures that each IAM entity has only the minimal permissions required to perform its tasks.
5.1.4 Policy Layers
| Policy Type | Chinese Name | Scope | Typical Scenarios | Characteristics |
|---|---|---|---|---|
| ✅Account Policy (Global) | 账户级策略 | Affects all users/resources under an AWS account or organization | Unified security baseline/global restrictions (e.g., disallow deleting CloudTrail, disabling billing monitoring) | Implemented viaService Control Policies (SCP) in AWS Organizations by admins |
| ✅Group Policy (Team-level) | 用户组级策略 | Applied to a user group; uniform permissions for members | Grant Dev group “S3 read/write + EC2 start/stop”; Finance group “view billing” | Facilitates team-wide authorization; members inherit automatically |
| ✅Role Policy (Service-level) | 角色级策略 | Granted to services (Lambda/EC2/EKS Pod/CI-CD tools) or cross-account access | Lambda accessing DynamoDB, EC2 pushing logs to S3, GitHub Actions deploying to AWS | Formachine identities / temporary access (AssumeRole); follow least privilege |
| ✅Inline Policy (Exception-level) | 内联策略 | Embedded directly in a specific user/group/role | Temporary emergency grants, short-term tests/special tasks (e.g., open S3 upload for one day) | Same lifecycle as bound object; removed on deletion; not reusable |
In one sentence: Account Policies control what the entire account can do; Group Policies control what groups of people can do; Role Policies control what a service/system can do; Inline Policies are for short-term exceptions or temporary grants.
5.1.5 ARN Components
ARN (Amazon Resource Name) is the unique identifier for an AWS resource.
| Part | Meaning | Example |
|---|---|---|
arn |
Fixed prefix indicating an ARN | arn |
partition |
AWS partition (usually aws) |
aws (global) / aws-cn (China) / aws-us-gov (GovCloud) |
service |
Service name | s3, ec2, iam, lambda, dynamodb |
region |
Region code | ap-southeast-2 (Sydney) |
account-id |
AWS account ID (12 digits) | 123456789012 |
resource |
Specific resource name/path | bucket-name, instance/i-12345, role/MyLambdaRole |
5.1.6 ARN Examples
| Service | Example ARN | Description |
|---|---|---|
| S3 Bucket | arn:aws:s3:::my-data-bucket |
Entire bucket |
| S3 Object | arn:aws:s3:::my-data-bucket/images/photo.jpg |
Specific object in bucket |
| EC2 Instance | arn:aws:ec2:ap-southeast-2:123456789012:instance/i-0abc1234def567890 |
A VM instance |
| Lambda Function | arn:aws:lambda:ap-southeast-2:123456789012:function:MyFunction |
A function |
| IAM Role | arn:aws:iam::123456789012:role/MyLambdaRole |
An IAM role |
| DynamoDB Table | arn:aws:dynamodb:ap-southeast-2:123456789012:table/Users |
A table |
5.1.7 ARN in Policies
In the Resource field of an IAM policy, the ARN specifies the target resource to allow/deny.
{
"Effect": "Allow",
"Action": "s3:GetObject",
"Resource": "arn:aws:s3:::my-data-bucket/*"
}
Meaning: allow reading all objects under the my-data-bucket S3 bucket.
5.1.8 Wildcards in ARNs
Sometimes you need to match a class of resources using *:
"Resource": "arn:aws:lambda:ap-southeast-2:123456789012:function:*"
This means: allow access to all Lambda functions in the current account within that region.
5.1.9 Policy Reviews
- ✅ Use IAM Access Analyzer to detect over-privileged permissions
- ✅ Use CloudTrail to audit all IAM change operations
5.1.10 Past Practice
- In previous internships, the initial approach was direct access from a local program to AWS Bedrock, which made approval difficult.
- Later, calling Bedrock via AWS Lambda reduced permission requirements and achieved the goal.
5.2 Secrets Management
5.2.1 Management Method
AWS Secrets Manager + AWS KMS (Key Management Service)
5.2.2 Secret Retrieval Flow
Note: Secrets Manager can enable automatic rotation, generating a Rotation Lambda Function to periodically create new passwords.
5.2.3 Secrets Manager Storage Structure (Example)
{
"username": "admin_user",
"password": "N9pLxkK4Q7z!",
"engine": "mysql",
"host": "db-prod.cluster-xxxx.ap-southeast-2.rds.amazonaws.com",
"port": 3306,
"dbname": "production"
}
5.2.4 Access Control for Secrets (IAM + ARN)
Each secret has a unique ARN, for example:
arn:aws:secretsmanager:ap-southeast-2:123456789012:secret:prod/db-password-AbCdE
To access that db-password, a user or service must have a policy like:
{
"Version": "2012-10-17",
"Statement": [{
"Effect": "Allow",
"Action": [
"secretsmanager:GetSecretValue"
],
"Resource": "arn:aws:secretsmanager:ap-southeast-2:123456789012:secret:prod/db-password-AbCdE"
}]
}
5.2.5 Past Practice
- GitHub Secrets;
- Use of self-hosted enterprise-grade KMS.
5.3 Encryption in Transit / At Rest
5.3.1 Past Practice
- Frontend username/password transmitted with MD5; Flask salted; ciphertext stored in DB;
- TLS/SSL (HTTPS) to encrypt site access;
- Encrypted protocols between services (e.g., Remote Desktop, vmess, ssr);
- Application layer: VPN.
5.4 Network Controls
5.4.1 Past Practice
- Subnets: initially designed subnets + bastion for the alcohol auditor;
- Security Groups: when using OCI, configured accessible IPs and traffic ports;
- Linux firewall: for personal sites, only opened necessary service ports; others closed;
- Microservice traffic control: used Spring Cloud Gateway and Alibaba Sentinel for rate limiting.
5.5 Vulnerability Remediation
5.5.1 Past Practice
- Package scanning + manual review;
- Monitor official security bulletins (e.g., CNNVD), rate risks, and report;
- Prepare upgrade plans: set explicit upgrade dates and email stakeholders.
5.6 Audit Support
5.6.1 Past Practice
- At ICBC: recorded vulnerability remediation, patch updates, and risk closure;
- Submitted infrastructure & configuration change history (CSV files) for audit.