Whether you are responsible for a single application, a large environment, or somewhere in between, it’s likely you’ve wanted to know how to maximize your cloud service providers’ capabilities. This usually results in questions like:
I find the usual responses to these questions either too abstract (e.g., form a team, establish Key Performance Indicators (KPIs)) or too specific (e.g., set up DynamoDB, deploy a multi-container group using docker compose). This cookbook provides the missing middle ground between those two extremes. It comes in 3 sections:
In some organizations — and for some applications — you may find that the situation warrants one or another of these strategies to migration:
Move your application into cloud virtual machines (VMs). Commonly known as the "lift and shift,” this migration strategy can be quite satisfying with a churlish application, for a quick win under time pressure, or as an initial step in a campaign of cloud adoption.
Ingredients
(
represent relative scales of effort for migration work. Ongoing costs are more complex, but, generally speaking, greater upfront effort results in lower ongoing costs.)
Procedure
Move your application to the cloud and embrace the catalog of services your provider makes available. By pulling supporting services out of your VM configuration, this strategy can simplify your configs, reduce maintenance and make good use of the constant improvements cloud providers are incentivized to roll out.
Ingredients
Procedure
To illustrate the flexibility of the Replatform strategy, here's a specific spin that targets provisioning compute resources into a container service. This is where we see the Dockerfile take the place of the configuration management solution. Here, we're left to manage only the application endpoints and access to the cloud provider APIs.
Ingredients
Procedure
Dockerfile
that encapsulates and runs the application.The re-architect (a.k.a., refactor) strategy is usually undertaken to meet specific business needs (e.g., enable massive scalability, facilitate ongoing development efforts, remove vendor dependencies) and will change where and how your application runs. One popular embodiment of this strategy is the Strangler Pattern which uses a proxy to direct requests to the preferred implementation of the functionality — either for the original application or a superseding service/microservice.
Ingredients
Procedure
Instead of fixing up your old application, meet the need with a new solution. Implementing a suitable cloud offering, open source project, SaaS or commercial off-the-shelf option can deliver aspects of the cloud dream: simplified set up, reduced maintenance and continual updates.
Ingredients
Procedure
Stop running the things you don't actually use. The system that is most secure, is cheapest to run and has the fewest bugs is the one that does not exist.
Ingredients
Procedure
On second thought, leave this application where it is. The top reasons to employ this strategy include: upcoming application retirement, cost amortization and/or deferring due to other priorities.
Ingredients
(
represent relative scales of effort for migration work. This doesn’t account for risk, missed opportunity or ongoing cost.)
Procedure
Every organization has different tastes and each application can drive specific requirements. For each of the following items, consider what capabilities are needed for the application to adequately serve its function and how much time/effort/funding you're willing to invest. These will be useful inputs for selecting a migration strategy and determining which cloud services to target.
Have duplicate copies of your data spread out through time and geography.
Important Questions:
Common Solutions:
Have a plan — or resources standing by — to recover from adverse incidents.
Important Questions:
Common Solutions:
Have a process for making updates and changes.
Important Questions:
Common Solutions:
Have defined expectations so you can evaluate the migration.
Important Questions:
Common Solutions:
Have the documentation ready and up-to-date.
Important Questions:
Common Solutions:
Have a plan for the more commonly neglected dependencies.
Important Questions:
Common Solutions:
Have a process for executing migration processes (if you need one).
Important Questions:
Common Solutions:
Before digging into the migration specific details, let's spend a moment discussing a few methods that will keep your work clean and safe.
The most widely accepted and effective practice for building and maintaining clean cloud environments is Infrastructure as Code (IaC). To manage an environment or resources with IaC means that instead of allowing arbitrary, untracked changes to be made in your cloud provider, resources — and their configurations — are controlled via a process that is rooted in a version control system (e.g., git).
A typical IaC scenario would consist of declarative configuration written for a suitable IaC tool (e.g., Terraform, CloudFormation, Azure Resource Manager), which is updated and applied when changes are needed. This could also periodically run as a change detection mechanism. This practice eliminates error prone manual provisioning and enables reproducibility, repeatable testing, rollback capabilities, reusability, history tracking and grep-ability.
Running a non-IaC environment is like having an entire data center without any cable labels. With IaC, you are able to spawn new data centers at will.
Immutable Infrastructure is another key approach for setting boundaries on when and where the complexity of our systems is allowed to grow. The concept of Immutable Infrastructure is that once compute resources are deployed, they will only be replaced, removed or stopped/restarted. Any action that would otherwise involve changing the application or configuration is accomplished by replacing the nodes that are currently running the application.
Container based platforms and Functions as a Service (FaaS) cloud offerings have this concept baked in, but when you are using a bare VM cloud service, the path to implementing this is to configure an automated, version controlled process that will build VM images and trigger the upgrade deployment, rollback and maintenance routines the application requires. Such an image build process can make use of cloud provider tools, Hashicorp Packer or Configuration Management tools (e.g., Ansible, SaltStack, MS DSC).
If this capability feels out of reach, the first step is to automate your server setup with a CM tool. The second is to use the CM tool maintenance tasks like patching and deploys. Eventually, the path to enhanced sanity and control in your compute environment will open to you.
What is the most fundamental principle of security? In my opinion, it’s Least Privilege. In all things, grant only the permissions that are required for your systems, people and processes to function. This applies to the capabilities of your application, access into and out of your networks, and access to your cloud provider. The potential liability of leaked cloud provider credentials is equal to the cost of all your data being deleted, plus the hourly rate of the most expensive resources times the number of hours the leak is unnoticed. Of course, there is a balancing point between granularity and manageability, but Infrastructure as Code shifts that balance in our favor.
Also, it is best practice to set up billing alerts for your environments. All cloud providers have them, and, even when they aren't free, they're worth it. If you don't have an estimate, make up a threshold that seems reasonable, and check the threshold against your actual costs until you know what to expect. I prefer a threshold that has room for monthly variability but is low enough to catch when something unexpected is happening.
Want more advice and orienting principles? Read more here and here.
Since cloud migration consists of two categories of changes — provisioning, instantiating and initializing resources in a cloud provider and cutover, transitioning application activity from one environment to another — good change practices directly apply to cloud migration activities.
There are two components needed to responsibly make changes to an application or its environment. The first is actually making the change; the second is validating the change.
Validating a change requires satisfying only two conditions: prove that the change is in effect and prove that the application is behaving as desired. The degree of rigor required for these proofs will vary according to your circumstances (e.g., frequency of the change type, SDLC phase, application criticality/visibility, blast radius).
Let's dig into some of the options:
terraform plan
should not indicate that changes are needed. a.b.c.
etc. A good set of these "sanity checks" can quickly give you evidence of a misconfiguration and kick off the process of carving down the problem space if there is an issue. These tests are usually straightforward to automate using the check/test modes of CM tools, specific verification tools (e.g., Serverspec), cloud provider platform features, etc. These tests are useful for provisioning and — after any adjustments for architecture changes — pre/post-cutover scenarios.
These kind of in-the-weeds application level tests can provide a major confidence boost that things are running as expected and can potentially serve as a basis for comparison between environments. Typically, these tests would be written with a protocol specific framework (e.g., for http applications, tools like cypress, puppeteer, or selenium). Since they're specific to application behavior, these tests are likely to run unchanged for provisioning and pre/post-cutover.
When executed under appropriate conditions with sufficient care, this kind of monitoring regime can weed out otherwise overlooked issues and aid in continuously ratcheting up application quality. Infrastructure components required by this kind of testing are available in many cloud providers. Support for custom metrics would need to be integrated into the application, and procedures for generating summary metrics and comparing results would need to be evaluated and adopted. Early versions of provisioning configuration may not be ready to integrate with this, but it can shine with pre/post-cutover.
Specifically:
These capabilities — though not impacted by most changes and infrequently thought of — are an application's last line defense from a high-severity week of extreme sadness. Automation for these processes is typically written with an IaC or CM tool. Be sure to test these capabilities whenever they may be impacted by a change and on a regular cadence.
When some applications are in the process of being migrated to the cloud, they get more focused attention than they normally would. Any automated tests that can be used to validate the new environment, build confidence for the cutover or verify the application's health are likely to come in handy post-migration when the focus turns to feature deployment, maintenance and routine operations. So, at the very least, pick the low-hanging fruit.
Depending on where the data for your application lives, how much there is and the nature of the application's interactions with it, getting data synced in a timely and consistent way to your target cloud service could take you down one of a few different paths.
A direct approach can be effective in simple cases that have:
Upload a directory of assets/data to a cloud service with standard cloud provider tools or to a cloud VM with rsync
or zfs send
. Alternatively, use a cloud VM as a temporary endpoint for a cloud service by mounting cloud storage to the filesystem, or tunneling with ssh or proxying via nginx/haproxy to cloud resources in the cloud network. This is an excellent method for loading fresh database dump, second only to scp
ing the dump to a cloud VM then running the DB load, which needs to push fewer bits over the internet if the DB load needs to be restarted for some reason.
If you are deferring a transition to immutable infrastructure until a later opportunity, for most on premises virtualization/cloud provider pairings there is a path for VM export/import. Plus, some cloud providers offer a background-transfer, live replication service. Be aware of your CPU arch, device names, drive mappings, customized hosts files, etc. when configuring a whole VM move like these.
Things get more interesting when the data that needs to be synced would take a week or more to transfer over existing network connections. For these cases, I would look at the "Sneaker Net" services that the target cloud provider offers.
A typical workflow with the "Sneaker Net" services is:
This is especially useful if the change rate or other factors don't invalidate the data.
Another solution is to use a cloud integrated storage appliance (e.g., throttle, encrypt, optimize) to manage transferring the data. These can be a "computer in the mail" flavor or a VM that you provide underlying storage for. Essentially, these devices provide network storage services (e.g., NFS, SMB) and copy the files they're given out to a cloud service. So, you set up the device, do a local storage migration to it and let it trickle your files to the cloud service according to the transfer parameters you provided.
When using a cloud integrated storage appliance, make sure you:
Either because of the time elapsed in completing an initial migration or an unrelenting rate of changes, some scenarios might require an initial data transfer be continually updated with new changes from the live environment. For innocent file data, iteratively running rsync or zfs send can be sufficient. Otherwise, you may be forced to use some specialized file replication product.
The path for keeping database data in sync feels somewhat clearer:
To work well, either of those options will require adequate network access, stability and bandwidth.
In some circumstances — such as if an application's complexity is better suited to being decomposed into multiple, phased migrations (following the Strangler Pattern, for instance) — it can be worthwhile to add features to the application to directly support the changes being made:
Such an approach increases the surface area where problems might occur, but adding features to the application may beat the alternatives.
Finally, some painful cases need local adjustments to be prepared to support migration activities:
Years ago, I had the pleasure of dealing with an application that would somehow re-write the alternate data stream of a file after it had closed the file handle. The details were never clear to me, but the share full of corrupt documents spoke for itself.
Take the time to understand your application's relationship to its data and your organization's priorities when planning this phase of your migration.
It is tempting to think cutover is solely about making the changes that update which environment is live, but I recommend thinking about cutover in three distinct phases: the Plan, the Change and the Rest.
The first step in formulating the Plan is to ensure you've met all the application's "Nutritional Requirements" and any bonus Organizational requirements — especially those that are difficult to bolster later on. These include making sure:
The second step in formulating the Plan is to check your priors. Verify that you know how to gauge the traffic hitting the live environment and how that flow is controlled. Verify that your Testing and Validation steps cover the most important and informative areas of the environment.
The third and final step of formulating the Plan is to literally write a plan. You can decide the level of formality, whether someone else should review it and where to add it to the project documentation.
The plan should consist of at least three parts:
At a minimum, the plan should be written with enough detail that your nearest coworker can step in for you. Writing change plans in this pattern is useful for:
The Change consists of executing the plan and noting any deviations found to be necessary. For cloud migrations, I would expect most cutover changes to consist of a DNS update. When planning a change that updates DNS, remember that the TTL value determines the duration of cache validity for the record. You probably want to drop it to a low number ahead of time so that the change is propagated to the world without delay (then raise it again in the Rest).
For validating a DNS update, use nslookup
locally, then mxtoolbox and, for good measure, whatever site shows up first after searching "global dns check." After that, complete the required application specific checks and look to see how much traffic is still hitting the old environment. A different cutover change I might expect would be updating a proxy server's configuration. Validation here is also straightforward. Simply check the proxy's backend metrics to verify requests are being sent to the cloud environment, followed by the application and old environment investigation. By now, you've got the idea.
The double meaning of the Rest is intentional. First, congratulations! The application is migrated; go ahead and take the weekend off. Second, determine what the next priority is:
This is my take on the options, significant considerations and best recommendations for approaching a cloud migration, presented in the most palatable format I could think of.
Hopefully, what you have read here has whet your appetite for a migration project or two. Now, you can:
Or, something better!
Share your migration projects with us, and schedule a free consult today!