Photo by Claudio Schwarz on Unsplash

Data Migration Using Spark Glue Job

Amarpreet Singh
4 min readJul 25, 2022

--

Recently I was working on migrating data from RDS for PostgreSQL table in one account to DynamoDB in another AWS account in the same region. Although the table was simple with 6 columns and one primary key, it had about 10 million records.

I initially tried using AWS DMS, but it was proving to be tricky to set up, as I was on a time crunch. Also, using DMS for cross-account migration was painful from a networking perspective. Cross-account DMS migration requires you to have non-overlapping CIDR blocks and in this organization, a lot of teams shared the same IP space.

After a few days of trial and error, I decided to go with the Spark job using AWS Glue.

Bingo! It took approx 35 minutes to migrate ~10 million records!

Solution Architecture of Data Migration

The architecture of cross-account RDS to DynamoDB migration

This migration starts with taking the snapshot of your RDS instance and exporting that snapshot to an S3 bucket. It is important to note that the format of this snapshot is in Parquet format. Then create a cross-account DynamoDB role in account B. Create an AWS Glue job in account A, assume the cross-account DynamoDB role, and run the job.

Let’s look at each step in more detail.

Create an RDS snapshot:

Using AWS CLI, run the following command with the RDS DB instance and snapshot name.

aws rds create-db-snapshot \
--db-instance-identifier database-mysql \
--db-snapshot-identifier mydbsnapshot

Export the snapshot to S3:

Ensure to give correct permissions to the IAM role for writing to the S3 bucket when exporting a snapshot, and the correct KMS key ID to encrypt the snapshot exported to S3.

aws rds start-export-task \
--export-task-identifier unique-identifier \
--source-arn snapshot-arn\
--s3-bucket-name mybucket \
--iam-role-arn role-arn-\
--kms-key-id key-id-to-encrypt-snapshot

Create a DynamoDB table:

Create a DynamoDB table in Account B where data needs to be migrated with the correct partition/sort key and read/write capacity mode.

aws dynamodb create-table --table-name TestTable --attribute-definitions AttributeName=employeeId,AttributeType=S --billing-mode PAY_PER_REQUEST --key-schema AttributeName=employeeId,KeyType=HASH

Cross-account role:

AWS Glue ETL jobs support both cross-region and cross-account access to DynamoDB tables. In account B, create a cross-account role with DynamoDB read/write access to specific resources.

In account A, create an IAM Policy to assume the cross-account role created above which can be attached to the group/role/user you would like to use to access DynamoDB.

{
"Version": "2012-10-17",
"Statement": {
"Effect": "Allow",
"Action": "sts:AssumeRole",
"Resource": "<DynamoDBCrossAccessRole's ARN>"
}
}

Create a Glue job in Account A:

AWS Glue provides a fully managed serverless environment to perform ETL jobs on large amounts of datasets. It helps you categorize, clean, enrich and move your data reliably between various data sources. Glue runs your job on a serverless spark platform, which automatically provisions the resources that you need to run your job. While creating a glue job, it is important to create a role for a glue job that can, for example, read from S3, write to DynamoDB, etc. Attach the above cross-account policy to the glue job IAM role so it can write to DynamoDB in Account B.

AWS Glue gives various options to create a job — you can write/upload your own spark code, write your own python shell script or create a job visually. A Spark ETL job has three sections:

Data Source- where your input data is stored either in S3 or a data catalog table, and it could be in either JSON, CSV, or Parquet format.

Glue Job data source

When you export the RDS snapshot to S3, it exports it into Parquet format. In the glue script, it is important to select connection_type as parquet .
Once it reads all the Parquet files, create a new data frame with one partition.

Transform- where you apply mappings that map fields to new names and data types, drop columns, or rename fields.

Glue job data transformation

In the above code snippet, I’m mapping RDS columns (id, employeeID, first_name, last_name, salary, employed and created_on) to relevant DynamoDB attributes. In case your data in the RDS table has a null value, you can set a default value in the transform section.

Data Target- where the output data is stored.

Glue job data target

In the above code snippet, I’m writing the dynamic frame to DynamoDB. There are various connection_type you can pick, depending on your needs as described here. It is important to use the cross-account role ARN created above in connection_options.dyamodb.sts.roleArn field, which ensures your glue job writes data to DynamoDB in account B.

Other details to consider…

Update backend services for dual-writes, where services will write data to both RDS and DynamoDB. This is to ensure you don’t lose any data while the glue job is running. Once the glue job is successfully executed, make sure to test the data migrated to DynamoDB using functional and performance tests. Detailed functional testing will make sure that your application is working with the new database without any issues. Performance testing will make sure that your database response times are within an acceptable time range. Having a robust automated testing framework is key to successful data migration. There are many other best practices that AWS provides which are important to consider if you’re planning to develop a strategy for database migration.

I’d be interested in hearing about alternate solutions to this problem. Please let me know in the comments!

Thanks to Rishi Raj for providing insights on AWS DMS!

--

--

Amarpreet Singh

I'm a Solution Architect and engineering leader based in San Francisco, passionate about exploring new technologies and tackling interesting challenges.