The herculean lift: pulling your data stack from servers to the cloud
Data is the world's new currency. However, in order to derive value from the Big Data paradigm, it is critical to design, develop, and deploy a cloud-based data architecture. Additionally, it is essential to move the on-premises data stack to a cloud-based ecosystem.
The wired.com article, Data is the new oil of the digital economy, describes how data has become the new secret resource for breakthrough profit opportunities. Raw data and crude oil are similar in that they need to be processed, (or analyzed), in order to use them. Ergo, crude oil is transformed into diesel fuel, paraffin, and petroleum or gasoline. In the same way, raw data is processed so that it can be analyzed and transformed into statistical information, driving strategic decision-making across all sectors of the organization.
"Moving data from servers to the cloud is not just a technical shift, it's a strategic decision that can transform an organization's agility, innovation, and bottom line. However, to achieve the full potential of cloud computing, organizations need to consider not just the technology, but also the people, processes, and governance involved in managing and securing their data in the cloud." - David Linthicum, Chief Cloud Strategy Officer at Deloitte Consulting LLP.
Many organizations still employ the legacy data architecture where all software applications interact directly with the data. The core function of these applications is to facilitate the movement of data between the database and the user interface. Even though the volumes of data generated by the legacy architecture form part of the Big Data paradigm, it is still traditionally stored in data silos in on-premises servers. And migrating data from one set of servers to another, or one data center to another, is a massive hassle. Below enumerates this process visually on a 7-step timeline:
There are several negatives to this architecture, including duplicated data, data inefficiencies, increased costs due to the inability to scale up and down without having to purchase expensive hardware components, the risk of system failures and the need to implement costly failover mechanisms, and the inability to process petabytes or exabytes of data in near real-time due to the constraints of the legacy data stack.
Based on this information, it is logical to conclude that an alternative solution must be sought. In other words, the legacy server-based data architecture must move to a cloud-based ecosystem.
Moving from legacy data architecture to a cloud-based data stack
The fundamental reason for implementing a cloud-based real-time streaming data architecture is best highlighted by this quotation by Tim Berners Lee, the inventor of the World Wide Web. Spoiler alert: it’s to answer questions asked by enterprise CEOs.
“Any enterprise CEO really ought to be able to ask a question that involves connecting data across the organization, be able to run a company effectively, and especially to be able to respond to unexpected events. Most organizations are missing this ability to connect all the data together.”
The cloud has so many goodies to offer organizations across the scope of corporate, non-government, and government organizations. Because the cloud-service archetype has matured, cloud adoption is on the rise, with cloud-based technology spending expected to increase exponentially or over six times the rate of the general IT spend from 2020 onwards.
Cloud services offer scalability, always-on data availability, mobile, and performance for the mobile workforce. Therefore, migrating legacy data architectures such as on-premises Apache Hadoop and Spark workloads remains a key priority for these organizations.
Big Data: the archetype and practice of storing massive amounts of data
“A Big data architecture is the overarching system used to ingest and process enormous amounts of data.”
Garrett Alley, Big Data Zone
The term “Big Data” refers to the voluminous and ever-increasing diverse datasets of raw data, both structured and unstructured. In other words, massive amounts of structured and unstructured data are generated by the modern organization. This data must be extracted from its source, transformed, and loaded (ETL) into a data store where it is analyzed, providing information for decision-making purposes right across the organization’s operational scope.
Therefore, the Big Data architecture design pattern must pay close attention to the need to manage massive data volumes. Additionally, this architecture must also consider the fact that organizations generate different data types from multiple data sources, including structured and unstructured data types like event and system logs, flat files, and database tables.
One of the most productive and cost-efficient ways of managing such variable data is implementing the data pipeline architecture as a mechanism to get the data from its source into a data store. The data is ingested at its source, processed during its journey up the pipeline, and then deposited into a data store where it is available for analysis using statistical and advanced mathematical algorithms.
However, the data pipelines are not the only element of a successful Big Data architecture. They are a vital part. In most instances, the Big Data lifecycle cannot be successful if the data pipelines are not successfully designed, developed, and deployed. Therefore, all data architectures should start with the data pipelines as the vital link between the various data sources and the data’s destination in the data store.
Current challenges in Big Data
Before we consider a CTO’s guide to moving from a legacy server-based data architecture to a cloud-based Big Data stack, let’s look at some of the challenges facing the storage, ETL, and analysis of the voluminous data that makes up the Big Data paradigm.
Data costs money to move
Even though the month-to-month cost of running a cloud-based data stack can be substantially lower than maintaining an on-premises legacy data server stack, the cost of moving to a cloud-based architecture can be prohibitive. The database prep work (backing them up, renaming servers, cataloging them, checking user accounts, capturing all the logins, etc) alone can require weeks or even months of work. And, this inventory effort requires time and energy away from other critical activities, so there’s an opportunity cost as well. Copying/migrating the data into new databases up in the cloud is usually not an issue, (our friendly cloud vendors like Microsoft Azure and AWS love data ingress, just not data egress), it still carries with it some nuances and changes required. Some of the stuff you’re use to with SQL Server running on dedicated servers just simply does not exist in an Azure SQL dynamic database instance or an Azure SQL Elastic Pool instance. No matter what you hear them say, just keep in mind that there are often some code changes required to get those databases to work properly in the cloud. Examples of these include:
- User Defined Functions
- Database Mail
- SQL Agent Jobs
- Full-Text Search
These are the main ones... most other stuff works. But still, it can be a gotcha moment that must be accounted for in the time and money estimations and plans.
Data quality, scaling, and security
Data quality becomes a challenge when dealing with massive volumes of structured and unstructured raw data from disparate data sources. This challenge is not insurmountable. It is easily solved by creating data pipelines from each data source to its destination or data store, implementing an ETL process that accurately transforms the data, ensuring that quality data is loaded into the data store.
The raison d’etre for, and value of, Big Data is in its volume. However, this can become a significant issue if the enterprise data architecture is not designed to quickly and efficiently scale up (and down). The best solution to this challenge is to deploy a cloud-based data stack on a platform like Amazon AWS that has scalability as one of its offerings and benefits.
Lastly, there has been a rapid increase in the number of cyberattacks since the start of 2020. This is primarily due to the ongoing COVID-19 pandemic. It can be challenging to protect the massive volumes of data generated and stored by an organization. Hackers are very interested in this data. Therefore, they might try and skim this data or fabricate data and add it to the data store. Consequently, it is imperative to ensure that this data is protected by the latest cybersecurity tools, methods, procedures, and processes.
Designing, deploying, and maintaining the cloud-based data architecture
From the perspective of an IT organization running legacy systems on local servers and loading data into an on-prem data warehouse via batch ETL processes, the overall IT spend curve leans away from what once was cost-effective. It’s starting to offer diminishing returns-time to move the company's big data stack from on-prem to the cloud. Let's look at the building blocks of a cloud-based data architecture that are responsible for the ETL and analysis or real-time data stream processing.
Every big data architecture design must start with the data sources. These can include structured and unstructured data from relational databases (RDBMS), static data from flat files and spreadsheets, as well as real-time data sources like telemetric data from IoT devices, system logs, and event logs.
The message broker or stream processor
The message broker is the element that ingests the data from its source, translates it into a standard message format, and streams it on to the next component in the data pipeline. Two of the most popular stream processing tools are Apache Kafka and Amazon Kinesis. These streaming brokers have a vast capacity of about 1 GB per second of message traffic, support remarkably high performance with persistence, and stream data with limited support for task scheduling and data transformation. In other words, their sole function is to stream data from its source to its destination.
The data store
The data is either persisted to a data store at this point, or it is processed by real-time analytics tools. Because of the size of this voluminous data and because of its ability to increase exponentially and rapidly, it is essential to utilize a data store, such as an Amazon S3 data lake, that has the capacity to scale up quickly and cost-effectively.
Real-time and batch ETL tools
The data streamed from the message brokers needs to be aggregated and transformed before SQL-based tools can analyze it. The ETL tools that must be slotted into the data pipeline architecture include Spark Streaming and Apache Storm. Their function and purpose include fetching events from message brokers as well as receiving and applying queries that aggregate and transform the data.
The last element of the data streaming pipeline architecture is the data analysis tools. As highlighted earlier, the data can be persisted to storage after the stream ingestion point, or it can be analyzed in real-time before being persisted to storage. Tools that are used at this point include Amazon Athena, Amazon Redshift and Kinesis, or ElasticSearch.
Amazon Athena is a serverless query engine and is typically used to analyze data after being saved in the S3 data lake. Amazon Redshift is a data warehouse. Therefore, Kinesis is used to stream the data into the Redshift warehouse, where BI tools are used to analyze it. Lastly, ElasticSearch is a distributed search and analytics engine, so a stream processor like Kafka must be used to stream the data into ElasticSearch before it can be analyzed.
Don’t forget about automation
Moving the data through the pipeline architecture on any sort of repeated basis, (beware, the famously one-time data hydration projects tend to evolve into repeated processes), requires the creation, testing, and deployment of automation scripts. And, a repeatable workload must be automated. [Our team at Product Perfect builds automation scripts for our clients and can help work through some of the complexities involved.] In that automation, it’s important to build with scalable size models involved. The typical mistake is to assume small data footprints, and inevitably, the data footprints never fail to increase and overload the hard drives that the automation scripts run on.
The value of Big Data, when processed correctly, cannot and must not be underestimated. As highlighted above, the cost of moving the data stack from an on-premises server will more than likely be substantial. However, this cost must be considered against the cost savings, robustness, and intrinsic value of the information produced by this architecture.
Considerations and takeaways in a data migration include:
- A solid database [data warehouse, data lake, data store] inventory is key. Taking the time to get it right upfront will save you tons of time when the project kicks off and DBAs are running around shoveling data into the cloud.
- The tools and vendors are actually really helpful. This is a unique time in the industry, where data loads to the cloud can be banged out faster than you can believe using a few slick wizards. So don’t discredit the tools and do take the phone calls from the vendors.
- Leverage existing infrastructure. So much of what you’re already running could run in the cloud.
- Choose your cloud provider wisely, slowly. Select a cloud provider that meets your organization's specific needs, taking into account factors such as pricing, performance, availability, scalability, and security, as well as, (perhaps most importantly - the development culture of your existing staff).
- Plan your migration as a first-class citizen project, resplendent with project managers, diagrams, and timelines. A comprehensive data migration plan to the cloud will include a robust timeline, a detailed budget, and resource-leveled matrix.
- Include the CISO for data security and compliance from the very outset. It’s entirely necessary to appoint this person responsible to ensure your data is secure both during and after the migration. That’s their job. Make them do it.
- Validate the plan with an external 3rd party. This is what so many consulting firms can do, costing often tens of thousands to perform, but saving hundreds of thousands or even millions in prevented losses or waste.