In my customer discussions, I have learned that organizations are increasingly choosing to do their big data in the cloud. At the same time, many organizations that initially implemented their big data on premises are looking to move their big data instances to the cloud, as well.
Amazon Elastic MapReduce (EMR) represents an alternative to on premises by providing a Hadoop framework that actually eliminates some of the work out of initiating a Hadoop instance from scratch. This matters because 32% of business executives in a recent IDG survey found a lack of technical depth to be a major impediment to success with big data.
EMR allows businesses to process data across a dynamically scalable Amazon EC2 instance and run distributed frameworks, including Apache Spark, HBase, Presto and Flink. At the same time, it can make use of data within AWS data stores, including Amazon S3 and Amazon DynamoDB. Since the big data software is already installed and configured by Amazon, businesses can spend their time improving the quality of their big data without worrying about infrastructure and administrative tasks.
However, Amazon EMR’s native security prevents many users from taking their big data to the cloud, and as importantly, making use of Amazon’s out-of-the-box solutions, including log analysis, web indexing, data transformations (ETL), machine learning, financial analysis, scientific simulation and bioinformatics.
According to the IDG survey mentioned earlier, significant concerns exist for public cloud big data around compliance, data governance and data security. These concerns are exacerbated unless sensitive data is permanently deidentified. And many are in fact doing this.
This approach, of course, does not work for many use cases, especially those where the data becomes tied to a predictive model. So wouldn’t it be great if you could take advantage of Hadoop while not increasing the potential of an unwanted data release?
If you could eliminate the data security risk, you could follow what is becoming a well-proven path of data analysis. In the last few years, there has been a dramatic increase in number of organizations running their big data workloads in the public cloud. This movement of big data workloads from on-premises to the cloud is due to the high total cost of ownership for on-premises Hadoop clusters.
To my surprise, Amazon claims that AWS EMR is now running more workloads than both Cloudera and Hortonworks. By leveraging Amazon S3 as central data hub for big data workloads, users can process big data sets at scale.
The problem is that big data implementations typically contain significant amounts of sensitive PII or PHI data. To operate with these data in the public cloud or even on premises, it is essential that this data is protected. What is needed is an architecture where EMR can consume protected data from S3. In this model, we can also consider EMR as a transient compute platform that is spun up and spun down as analytical results are required. The advantage being that you have massively powerful compute platform on an as-needed basis. What is missing, as we have been saying, is ability to protect data flows as they move in and out of EMR.
It always makes business sense to start by using the native protections provided within EMR and S3. But adding to them the ability to directly protect the data flow in and out of EMR, which provides the protection that regulated industries require. In other words, if hackers break into Amazon, they do not get your valuable sensitive data.
For those in regulated industry, this provides the kind of protection needed to securely lift and shift data your big data to the cloud. And the good news is that there are now multiple providers that can help you secure your data in Amazon. So what is stopping you?
Article written by Myles Suer
Image credit by Amazon
Want more? For Job Seekers | For Employers | For Contributors