Curbing the cost of cloud analytics and data warehousing • DEVCLASS

Cloud computing

Sponsored Article: Data craves liberation. Yet, it often comes at a hefty price, particularly when it requires extensive manipulation. To extract deeper knowledge from our data, we face the challenge of minimizing the expenses tied to computational tasks. Thankfully, the cloud comes to the rescue by enhancing the efficiency of colossal data repositories necessary for generating fresh perspectives, all while keeping costs under control.

Cloud computing - Figure 1
Photo devclass.com

The rising costs of buying and maintaining data storage and processing systems have had a significant impact on certain companies. However, despite the fluctuating prices of server and storage components, businesses are finding themselves needing to store and process more data in order to stay competitive. In a joint report by IDC and Seagate in 2018, called Data Age 2025, it was estimated that the total amount of data generated, captured, and replicated annually would increase from 33 zettabytes at that time to approximately 100 zettabytes this year. Furthermore, the report predicted that these volumes would rise by another 75 percent, reaching 175 zettabytes by 2025.

When considering the value provided for the cost, it is important to concentrate on the price performance.

As prices and the amount of data continue to increase, businesses need to prioritize efficiency to effectively handle the task of storing and processing enterprise data. This is why AWS is always working towards enhancing the capabilities of its data warehousing and analytics solution, Amazon Redshift. We recently had a conversation with Stefan Gromoll, an Amazon Performance Engineering manager, who is devoted to enhancing Redshift's performance for its users.

"Our main priority is to constantly enhance the cost-efficiency of Redshift, ensuring that you get the maximum performance for your investment," he explains. Ensuring consistent and predictable price performance enables customers to effectively manage the expenses associated with extensive data processing. If Amazon can outperform both traditional on-site data warehousing providers and other cloud data warehouse rivals in terms of price performance, and consistently enhance the cost-efficiency of its own service, then the performance team is successfully carrying out their responsibilities.

Separating storage and computation

The price involves both the expenses of holding and managing data. "Nevertheless, for the majority of clients, based on personal experiences, the main aspect is the expenses related to computing," explains Gromoll.

Since its introduction, Amazon Redshift has placed great emphasis on delivering cost-effective performance. However, it was in December 2019 that the introduction of RA3 nodes and managed storage truly revolutionized the Redshift experience for its users. Now, storage and compute can be scaled independently, allowing for greater flexibility and efficiency (for a comprehensive account of the timeline, click here). These nodes incorporate powerful SSDs for local caching and leverage Redshift Managed Storage to enhance cost-effectiveness and performance.

Separating the computing aspect from the storage aspect allowed customers to enhance their ability to process data without incurring expenses for superfluous storage. The team responsible for Redshift performance specifically dedicates efforts towards enhancing the efficiency of the computing and storage system. These enhancements are designed to provide visible benefits in terms of cost, rather than simply relying on benchmark measurements.

"We are interested in understanding our position based on these authorized indicators as we are aware that individuals put them to use," he explains. Nonetheless, our primary objective is to enhance the actual performance of data warehouses in real-life scenarios that truly impact customers.

Gromoll and his team often analyze the performance data from the Redshift fleet to uncover typical opportunities for improving performance. Afterwards, they put effort into discovering methods to maximize Redshift's data processing capability without increasing the expenses.

A series of improvements to enhance performance

An instance that occurred recently involved the implementation of string vectorization, which utilizes a technology that enhances performance in various string processing tasks. In the past, due to the limited size of registers, a solitary CPU core could only execute a single mathematical operation within a clock cycle. To handle concurrent operations, companies performed calculations across multiple cores simultaneously. However, there was still room for enhancing performance at the individual core level. As register sizes grew, a single register had the capacity to store multiple numbers. This enabled vectorization to execute a single instruction on multiple numbers simultaneously within a single clock cycle.

Redshift users predominantly store a significant portion of their data as strings rather than numerical values. Gromoll reflects on their realization that they could offer advantageous improvements in string performance for customers.

The team of engineers at Amazon has come up with a fresh approach to handle compacted string information on storage devices. By converting the algorithms that interpret encodings for string compression into vectors, they were able to perform scans on compressed dictionary-encoded string columns in a manner that is efficient for the CPU. According to Gromoll, this advancement has significantly enhanced the speed of processing numerous strings during queries that involve substantial volumes of string data, taking it up to sixty times faster.

In 2022, Playrix, a mobile gaming company with 85 million active users, decided to utilize Amazon Redshift Serverless to enhance their marketing analytics and boost game sales. To gain insights into player behavior, the company needed to analyze a vast amount of data, reaching the scale of petabytes. While Playrix had been relying on its EC2-hosted PostgreSQL database, it was struggling to meet their requirements. By making the switch to Redshift, combined with the serverless application container AWS Fargate, which efficiently gathers data from Playrix's partner systems, the company experienced notable improvements in query response times when dealing with extensive historical data. Additionally, this transition resulted in Playrix reducing their monthly expenses by 20%.

Parallelism is a significant aspect of Redshift even at the cluster level. Redshift, the data repository, includes a functionality called Concurrency Scaling. This feature dynamically allocates and removes computing power to handle fluctuating demands for both reading and writing queries. By automatically scaling its resources, Redshift effectively minimizes or eliminates query backlogs, expediting data processing for extensive workloads and preventing any hindrances. Customers are solely billed for the additional compute resources utilized by their queries.

The migration of Playrix to Redshift brought about another important advantage called Concurrency Scaling. Playrix deals with unpredictable tasks in gaming analytics, but with Concurrency Scaling, it can handle sudden and intense SQL queries from its own users and expand rapidly without incurring high expenses. Currently, Playrix manages and preserves a maximum of 5TB of live streaming information from its marketing collaborators in its Amazon Redshift data reservoir. By utilizing machine learning on this data, Playrix can make accurate forecasts about revenue and the long-term worth of its customers.

Redshift's Auto Workload Manager (AutoWLM) enables users to specify the queries that need to be handled by a concurrency-scaling cluster. This advanced workload management scheduler from Amazon eliminates the need for manual intervention by intelligently determining the optimal number of queries to be executed simultaneously and allocating appropriate resources like memory to each query.

This is significant as customers send numerous simultaneous queries, frequently in large numbers. According to Gromoll, "When you have these 1000 users who all send queries simultaneously, AutoWLM determines the most efficient way to execute all of those 1000 queries, maximizing throughput." The system also constantly learns from query patterns to customize this optimization over time, adjusting its query routing as warehouse usage changes.

The Redshift platform has implemented automatic workload management as a default feature. In July 2022, Amazon introduced a serverless alternative for Redshift, in addition to its existing provisioned Redshift deployment methods.

There are options to pay for instances as you go, but if you want to save money, it's better to think ahead and use reserved instances. When it comes to improving performance and saving costs, Playrix decided to go with Redshift for their reserved instances. However, they also take advantage of EC2 spot instances. These instances are temporary and may be terminated by Amazon at any moment, but they are priced extremely low. Smart customers can make good use of them for quick tasks.

Playrix has achieved significant success by implementing tactics like these. They have witnessed a tremendous increase of one thousand percent in the speed of their analytics queries after transitioning to Redshift. Remarkably, this performance enhancement comes at an equivalent cost to their previous standalone EC2-based PostgreSQL system.

According to Gromoll, we have made a significant investment in improving the performance of Redshift. He emphasizes that both provisioned and Serverless options can be easily scaled in small increments. This allows customers to have greater flexibility in managing their costs. Redshift provisioned warehouses can be expanded or reduced by just one compute node, while Redshift Serverless offers even more precise control with Redshift Processing Units (RPUs). Gromoll explains that this means customers can easily adjust the performance and cost to their specific needs without requiring unnecessary compute resources. In other words, if you only need a small increase in compute power, you don't have to double your warehouse size.

Witnessing the power of price performance The incredible performance of a product’s price Seeing price performance in action

The significance of these enhancements in price performance becomes more significant as customers process larger amounts of data in Redshift, and there are indeed many users who handle considerable data loads. Offering security and governance features that provide extensive identity management along with detailed authorization controls like Role Based Access Controls, Row Level Security, or dynamic data masking, without any additional charges to the customer, reinforces cost savings and enhances the overall price performance.

Nasdaq, the financial marketplace and organization responsible for hosting nearly 4,000 registered companies around the world, made the switch to Redshift in 2014 to enhance its business analytics capabilities. Presently, it receives and processes billions of financial records on a nightly basis, fueling its business analytics operation by analyzing four terabytes of data after the market closes. However, the main obstacle lies in efficiently and effectively transferring that data into the system for initial processing.

Due to the unpredictable changes in the market, Nasdaq collaborated with AWS to transform its data warehousing process that depended on Redshift. It moved its data reservoir to the managed storage layer of Amazon S3 and made the transition to Amazon Redshift Spectrum.

Redshift Spectrum permits the trade to directly interrogate its immense data lake in S3, eradicating the duration required for the extraction, conversion, and loading of data into Redshift independently. The fresh structure additionally detached storage and computation, empowering the company to wholly focus its computational nodes on query processing, significantly diminishing query processing durations by one-third.

The latest design of Nasdaq's structure has allowed it to expand its nightly record quantity from 30 billion to 70 billion and even more. Additionally, it has achieved its 90 percent data load completion mark five hours earlier compared to its previous state. This advancement prepares Nasdaq for analyzing tasks as soon as one hour after the market closes.

Another category of functionality within Redshift that aids in enhancing cost-effectiveness and reducing administrative burdens is referred to as "self-management capabilities". These capabilities enable businesses from various industries to leverage Redshift's potential without requiring additional investments in manpower.

"We understand that our clients prefer not to manually adjust their database in order to optimize its performance," he clarifies. "Hence, we have made significant investments in recent years towards autonomics technology, which enables the database to automatically fine-tune itself and provide the optimal balance between cost and performance."

An instance of autonomics in action involves the automatic enhancement of data storage and distribution within the data warehouse. Redshift autonomics has the ability to identify instances where data distribution can be improved for enhanced performance, and automatically transfers data to the appropriate nodes to optimize query execution. By strategically placing the data prior to running a query, it reduces the need for extensive data rearrangement during the query process.

Previously, database administrators had to manually designate the distribution keys for data allocation, but now the process occurs automatically. According to Gromoll, customers can simply upload their data and commence working with it. Redshift will autonomously analyze their workload and efficiently redistribute the data to ensure the finest balance between cost and performance.

Gromoll envisions greater possibilities in the field of autonomics in the coming years. He suggests that database teams can shift their focus from managing their data warehouse to extracting meaningful insights from the data. Additionally, his team dedicates their time to pinpointing specific enhancements in performance that may not appear significant individually but result in substantial cost savings when implemented across numerous queries. As data volumes grow, the team tirelessly searches for opportunities to optimize price and performance, which can ultimately benefit Amazon's customers.

Read more
Similar news
This week's most popular news