The Exponential Growth of Data
Renowned data scientist Clive Humby first coined the catchphrase “data is the new oil” all the way back in 2006. Since then, data has become the single most valuable asset for the majority of organisations worldwide.
The rate of data growth over the last few years has been exponential, and 90% of the world’s data was generated in the last two years alone. By 2025, the amount of data generated each day is expected to reach a staggering 463 exabytes globally.
More and more data is coming from mobile devices, applications and cloud platforms and in a variety of formats. Over 2.5 quintillion bytes of data generated every day. Data is created at every digital intersection, at every click, or at every stream, snowballing the demand for big data analytics market globally. Traditional data warehouses have been staples of enterprise analytics and reporting for decades. But they were simply not designed to handle today’s explosive data growth that we have seen over the last few years. Luckily with evolution of cloud computing , we now have cloud data warehouses in the cloud which can scale in minutes to meet today’s modern data ingestion volume.
What is a Cloud Data Warehouse ?
A data warehouse is a central repository which consists of structured data aggregated from diverse sources for the purpose of business intelligence and reporting which is used to derive effective decision making, which in turn is used to increase bottom line profit.
Expanding on the concept of a data warehouse, is a cloud data warehouse. A cloud data warehouse is a completely manage service that runs in the public cloud such as Microsoft Azure or Amazon Web Services. Since the Cloud Data Warehouse is managed service it can auto-scale up or out to meet changing workload demands. Modern cloud architectures combine three essentials: the power of data warehousing, flexibility of Big Data platforms, and elasticity of cloud at a fraction of the cost.
How Does A Cloud Data Warehouse Work ?
Data is loaded into a data warehouse using an ETL pipeline which runs periodically. As the data is moved, it can be formatted, cleaned, validated, summarized, and reorganized.
Who Are The Major Players in The Cloud Datawarehouse Space :
- Amazon Redshift
Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. You can start with as little as a few gigabytes of data and scale to petabytes. This empowers you to acquire new insights from your business and customer data
- Google BigQuery
BigQuery is a fully managed, serverless data warehouse that automatically scales to match storage and computing power needs. With BigQuery, you get a columnar and ANSI SQL database that can analyze terabytes to petabytes of data at incredible speeds. BigQuery also lets you do geospatial data analysis using familiar SQL with BigQuery GIS. In addition, you can quickly build and operationalize ML models on large-scale structured or semi-structured data using simple SQL with BigQuery ML. And you can support real-time interactive dashboarding with BigQuery BI Engine.
- Snowflake Cloud Data Platform
Snowflake is a fully managed MPP cloud data warehouse that runs on AWS, GCP, and Azure. When you’re a Snowflake user, you can spin up as many virtual warehouses as you need to parallelize and isolate the performance of individual queries. Snowflake enables very high concurrency by separating storage and compute to ensure that many warehouses can simultaneously access the same data source.
Top 4 factors to consider when choosing a Cloud Data Warehouse :
- Scalability & Elasticity
Scalability and elasticity are the single biggest factors that drive the adoption of cloud-based data warehouses; however, it is key to remember that not all cloud data warehouses scale the same way, the ease at which a Cloud Data Warehouse can scale is called Elasticity. When choosing a provider, consider how easy it is to scale, the cost of scaling, and what IT resources you need to grow along the way.
- Speed
There are two main ways to think about speed: Access speed, and processing speed. Which warehouse will help you get the fastest query times? How quickly can you get your data into and out of the solution? And related to that, how will you maintain your warehouse to keep that speed at optimal performance? You will need to effectively assess each architecture and find a balance between cost and speed.
- Cost
Each vendor has a different billing model, you may be charged at a flat rate, per query or per hour for storage and compute, or pay-per-use. Consider the cost today and in the future. For Amazon Redshift, you only pay for what you use. You are charged per-hour per node including both compute and storage. Snowflake has two types of pricing models. Snowflake On Demand is pay per use of storage and compute with no long-term commitments. Snowflake capacity offers price discounts on pre-purchased storage per month, plus compute capacity that depends on your needs. Azure charges monthly credits for storage plus a fee for any hour in which compute resources were used, regardless of the amount of time. And for S3, you’re charged for storage and number of requests. Choose a pricing structure that works with your needs so it’s easy to predict the costs down the road as you scale.
- Security
Another key point in how to choose a data warehouse is the built-in data protection and security features. Of course, there are governed companies’ standards that industries have to keep in mind. Customer data is valuable, but and it needs to be adequately secure and accessible only to authorized persons or applications.
Conclusion :
Choosing a Cloud Data warehouse is often a trade-off between cost, performance and scalability, and such a decision should not be taken overnight. Questions like how often do you plan on querying the data, how much data will be stored (currently and in the future), need to be asked.
Leave a comment