Data Warehousing is the process of storing and processing large amounts of data from many disparate sources. It's an endeavor that most large companies have to undertake eventually. There are many solutions for data warehousing available on the market. Which solution your company ultimately goes with depends on where you are as an organization.
If you find yourself in the Microsoft ecosystem, Azure Synapse is likely the first solution that comes to mind, as it's Microsoft's default answer to the data warehousing endeavor. However, you may have additional considerations related to organizational structure, staffing, data location or budget. It might be wise to supplement Synapse with a different product such as DataBricks or Snowflake.
Moving large datasets around can be slow and expensive, especially if you have to do it across the open internet. In most cases, you want your data warehouse close to as much of your source data as practical. If you are already using Azure, that’s great: all your data is already on the cloud. If you have a lot of on-premise data and you want to go with a cloud solution, that could become an issue.
Azure Synapse is Microsoft’s rebranding of its SQL DataWarehouse services, combining enterprise data warehousing and Big Data analytics. This means Synapse is fully integrated with the rest of the Microsoft ecosystem and adds support for Apache Spark, Power BI and Azure Machine Learning.
Your Synapse capabilities can be further enhanced by bringing in DataBricks. DataBricks is a further optimization over Apache Spark, making custom workloads easier to craft, provision and deploy in an ad-hoc manner. Its query optimizations, advanced caching, and indexing optimizations claim 10-40x performance over vanilla Spark. All of this is wrapped up in a virtual filesystem and notebook environment a la Jupyter Notebooks, which will likely appeal to any data scientists you have on staff.
If you find yourself not wanting to manage the infrastructure so intimately, especially if you need a solution that spans multiple cloud providers, Snowflake has a more turnkey solution.
Snowflake’s architecture separates storage and computing resources, which can make costs easier to predict and control. It also can be deployed to other cloud providers like AWS or GCP, which may be useful if your data is spread across clouds because you’ll be able to have a familiar architecture, interface and billing structure across them all. It’s even possible to deploy it with storage in one cloud and compute in another if you need everything fully centralized.
Its concept of virtual data warehouses lets you group compute capabilities separately from shared storage for instance.
It turns out that they can. Using Snowflake’s DataBricks Connector, you can bridge them and use Snowflake as a Spark data source for DataBricks in such a way as to leverage Snowflake’s SQL optimizations alongside DataBrick’s notebooks for more complex operations.