IBM Launches Hub for Data Scientists to Analyze Big Data using Apache Spark

On Tuesday, IBM announced the first cloud-based development environment for near real-time, high performance analytics, giving data scientists the ability to access and ingest data and deliver insight-driven models to developers.

Available on the IBM Cloud Bluemix platform, the Data Science Experience provides 250 curated data sets, open source tools and a collaborative workspace to help data scientists uncover and share meaningful insights with developers.

Building on its $300 million investment in developing Apache Spark as a type of "analytics operating system," IBM created the Data Science Experience to extend the speed and agility of Spark to more than two million members of the R community through new contributions to SparkR, SparkSQL and Apache SparkML. As a result, data scientists who work in R should have faster access to more data, and in turn, more insights delivered from the IBM Cloud.

The Data Science Experience's open environment allows data scientists to accelerate and simplify data ingestion, curation and analysis by bringing together the content, data, models and open source resources from IBM and others including H2O, RStudio, Jupyter Notebooks on Apache Spark in a single security-rich managed environment.

"With Apache Spark, we see an opportunity to significantly transform the role of the data scientist by providing access to curated data sets, open source tools and a collaborative platform to accelerate innovation," said Bob Picciano, Senior Vice President, IBM Analytics.

Case studies

IBM is already working with organizations to use data science applications built on Apache Spark. For example, using IBM Spark, IBM Bluemix and mobile technologies, the Bernhardt Furniture IT team designed a virtual showroom app for iPad devices that gives the sales team immediate access to the latest product information. Real-time analysis of traffic patterns and product trends allows Bernhardt to now make rapid adjustments to product placement, pricing and availability status.

IBM, NASA and the SETI Institute are working together to analyze more than six terabytes of complex deep space radio signals to hunt for patterns that might identify the presence of intelligent extraterrestrial life. With IBM Analytics on Apache Spark, SETI has embarked on a new Stellar Pair Eavesdropping campaign which enables the organization to look for potential communications between planets that might be orbiting in double star systems.


IBM continues to collaborate with data science organizations including Galvanize,, LightBend and RStudio to promote an integrated and unified data science ecosystem. Additionally, IBM is joining the R Consortium to help accelerate data science's readiness for the enterprise.

In the growing analytics ecosystem, IBM has contributed to related projects including Apache Toree, EclairJS, Apache Quarks, Apache Mesos, Apache Tachyon now called Alluxio, and major contributions to Apache Spark sub-projects SparkSQL, SparkR, MLLib and PySpark.

In addition, Spark is built into the core of IBM platforms, including Watson, Commerce, Analytics, Systems, Cloud. IBM also open-sourced its breakthrough SystemML machine learning technology to advance Spark's machine learning capabilities in 2015.

"With Data Science, the major roadblock is having access to large data sets and having the ability to work with so much data. With [Tuesday's] announcement, clients can have both," said Picciano.

Article published by Anna Hill
Image credit by IBM
Want more? For Job Seekers | For Employers | For Influencers