DB Migration: 2 AWS Services Compared

While working on a partial AWS Relational Database Services Postgres database migration, I started looking into two approaches to accomplish that for my particular issue. The task was to migrate only several tables in a 2TB Postgres database into another database.

The first approach focused on Amazon Managed Workflows for Apache Airflow (MWAA) in combination with dblink; the second uses AWS Database Migration Service (AWS DMS). Both approaches had to be repeatable and run somewhat on a schedule to keep the tables up to date.

There were about 250 tables in the source database, but the goal was to copy only half a dozen.

MWAA with dblink

The gist of this approach is to link the source database to the target database using dblink extension and MWAA as a task manager.

In order to create the dblink extension and configure it, the following SQL queries have been used:

CREATE EXTENSION dblink;
CREATE SERVER server_source_remote FOREIGN DATA WRAPPER dblink_fdw OPTIONS (host 'source.example.com', dbname 'source', port '5432');
CREATE USER MAPPING FOR masteruser SERVER server_source_remote OPTIONS (user 'source', password '***');

In order to start copying data between the two tables, I also had to manually create database schema by exporting it from the source database:

CREATE TABLE customers (
    id SERIAL,
    email character varying(255),
    deleted boolean,
    updated_at timestamp without time zone DEFAULT now() NOT NULL,
    soft_deleted_at timestamp with time zone
);

Once the schema has been created, the migration queries can be run:

SELECT dblink_connect('conn_db_link', 'server_source_remote');
INSERT INTO customers
SELECT * FROM dblink (
    'conn_db_link',
    '
    SELECT
        customers.id,
        customers.email,
        customers.deleted,
        customers.updated_at,
        customers.soft_deleted_at
    FROM customers
    WHERE customers.deleted = FALSE;
    '
) AS t (
    id integer,
    email character varying(255),
    deleted boolean,
    updated_at timestamp without time zone,
    soft_deleted_at timestamp without time zone
)
;
SELECT dblink_disconnect('conn_db_link');

To start the job in MWAA, we also need to set up the Postgres connection in “Admin >> Connections” and add a new connection of the Postgres connection type. We named our connection customer_rds, since we were copying our customer data from the source to the target database.

Setting up a DAG is straightforward and you only need to copy it into the S3 bucket associated with your MWAA instance:

from datetime import datetime
from airflow.models import DAG
from airflow.operators.postgres_operator import PostgresOperator
with DAG(
    dag_id="customer_dag",
    schedule_interval="@daily",
    start_date=datetime(year=2022, month=1, day=1),
    catchup=False,
) as dag:
    # 1. Truncate table in Postgres
    task_truncate_tables = PostgresOperator(
        task_id="truncate_tables",
        postgres_conn_id="customer_rds",
        sql="""
            TRUNCATE TABLE customers;
        """,
    )
    # 2. Load customers
    task_load_customers = PostgresOperator(
        task_id="load_customers",
        postgres_conn_id="customer_rds",
        sql="""
            SELECT dblink_connect('conn_db_link', 'server_source_remote');
            INSERT INTO customers
            SELECT * FROM dblink (
                'conn_db_link',
                '
                SELECT
                    customers.id,
                    customers.email,
                    customers.deleted,
                    customers.updated_at,
                    customers.soft_deleted_at
                    FROM customers
                    WHERE customers.deleted = FALSE;
                    '
            ) AS t (
                id integer,
                email character varying(255),
                deleted boolean,
                updated_at timestamp without time zone,
                soft_deleted_at timestamp without time zone
            )
            ;
            SELECT dblink_disconnect('conn_db_link');
        """,
    )
    #
    task_truncate_tables >> task_load_customers

This approach turned out to be very simple. The advantage here is that I was able to filter the data I wanted to copy as well as do JOIN operations. Unfortunately, any changes to the source table schema had to be manually applied to the target table schema.

AWS Database Migration Service

The AWS DMS approach was also very interesting. Initially, I was a bit hesitant to use it since I was not familiar with the CDC (change data capture) technologies; however, in time it became clear that AWS DMS is also very simple to use.

There are several steps to follow and one can be up to speed with it in no time.

Steps:

Create the replication instance.
Create source and target endpoints (have to use superuser accounts, otherwise the CDC will not work).
Create a database migration task.

Using AWS DMS, one can also migrate only selected tables defined in the database migration task. It is also possible to filter the data to copy, but it is not possible to do JOIN operations.

There are also three different migration types to choose from. You can:

migrate existing data,
migrate existing data and replicate ongoing changes; or
replicate data changes only.

I used the second migration type, following the very helpful user guide, in order to migrate my data and also replicate all ongoing changes in the tables I wanted to keep in sync.

Based on the user guide, I followed these steps:

Set the correct parameter: shared_preload_libraries= 'pglogical``' — you will probably need to create a new one and make the change there since you are not allowed to change the default parameter group.
Restart the Postgres source database.
Run CREATE EXTENSION pglogical; on the source database; run SELECT * FROM pg_catalog.pg_extension; to verify pglogical has been installed successfully.

I also experimented with creating logical replication slots but didn’t reach any meaningful conclusion.

One other advantage of using AWS DMS is that there is no need to create the schema for the database — DMS will take care of it.

Conclusion

Although I only wanted to partially copy some of the data stored in the tables, I ended up using the AWS DMS approach. Copying all data from source database to the target database took less time using AWS DMS as opposed to partially copying using MWAA with dblink. Eventually I will need to copy all data anyway, and using AWS DMS put me one ahead.

DB Migration: 2 AWS Services Compared

Table of Contents

MWAA with dblink

AWS Database Migration Service

Conclusion

Related Posts

Contact Us

HEAR FROM US