Building High-Performance Data Science Teams

Building High-Performance Data Science Teams

Analytics has become a top priority for many organizations looking to wring business benefits out of the data (structured, unstructured and semi-structured). Moreover, big data is still relatively new with many organizations, and its significance in business processes and outcome has been changing every day.

Most people consider business intelligence capabilities as an important starting point, thinking about structured data analysis and reporting. But considering advanced analytics, unstructured data, huge amounts of data and predictive models, business organizations should obtain much more knowledge – from massive parallel processing (MPP), cloud computing, social media data gathering and analysis to advanced statistics modeling, Hadoop framework skills and agile software development skills.

In the U.S., the hiring scale is 73 for jobs that require big data skills, with 13 candidates per job opening (as of November 16, 2015). The higher the hiring scale score, the more difficult it is for employers to find the right applicants for open positions. On the other hand, countries like Brazil have just started to look upon the big data wave, from almost no jobs posted on LinkedIn, for instance, to 117 in a last month, despite an economic crisis within the country.

Therefore, organizations should consider alternatives to hiring already trained people due to competition for talent in the data science arena. When focusing on more specific job titles, like sales manager, there is a 517% increased demand, which represents difficulty in securing an excellent, experienced professional.

BI or big data or machine learning?

People like to say “advanced analytics” because it may sound cool, but the fact of the matter is that most decisions – be it business or operational – are driven by much more basic analyses, such as reporting of numbers on a consistent basis or displaying data in dashboards.

In fact, most of the business processes that we inform with data are really BI-centric. Reporting and dashboard are the workhorses of what most companies do: 80 percent of the problems are perfectly suited-to and well-addressed by those tools.

But advanced analytics is the exciting technology that goes two or three steps beyond dashboard, which are actually using data to drive computations to inform more complex decisions – predictive analytics, decision modeling and at the end service provided by computers. Efficiency and quality improvement depend on data management, data analysis and data governance.

So the key question remains: Do I need to deal with big data and machine learning or just BI?

The answer: Any provider organization hoping to make the most of its data should keep both BI and analytics programs in top shape, but it needs to recognize the differences in detail.

Understand the skills required before building a team

Many data analysts do not find agile methodologies applicable to what they do. Unlike software development, they consider analytic workflows intrinsically uncertain and difficult for planning sprints (standup, sprint planning, retrospectives etc.). Traditionally, data visualization (reporting) was developed based on a waterfall approach where typically one expert provides the requirements in the earliest phases of the process and verifies the final result nearest the end.

In big data, multiple experts and the IT team must together build the solution in timely, defined iterative job tasks, requiring much more engagement and availability to the project. Agile is all about incremental delivery. What does this mean in practice? In my view, this should include not only results (the analysis) and insights (what that implies) but also the code or function that has generated them, so that the results are replicable. As a matter of fact, the agile principle has grown to deal precisely with the uncertainty regarding the users’ requirement and the business environment, imposing changes on the project frequently.

For a BI project, on the other hand, an organization needs maybe one guy for Extract Transform and Load (ETL) and one for the semantic layer plus visualizing dashboards. For big data and even more so for machine learning, several others are involved, such as an experienced software developer (Java, Python, R) and at least a knowing implementation in Hadoop ecosystem besides a list of available APIs developed from scratch. Each team member needs to collaborate within the team and with users; this is where the agile methodology comes to play.

Another point is more focused on the technical stuff. What will happen if data processing runs very large data sets without a proper software construction? Well, the Apache Spark driver could run out of memory, which is called an out-of-memory (OOM) error, and the program will crash. It will take a long time to send all the data to the executors, also. The best implementation approach is to look at the data frame reference API and notice that there is a solution to run completely at the executors, in parallel processing. This makes it scalable and efficient.

Currently, analytics problems require technological, quantitative and decision-making skills. Going further, each one requires a lot of specific knowledge and experience – no one alone would have a minimum proper knowledge. There are a number of potential technologies and architectures to define and build. There are several decision models based on psychology, behavioral science and practical results. Finally, there are plenty of statistical algorithms besides neural networks ones. How to select and properly use those elements is far from trivial, and still we don’t have a definite script for each scenario been analyzed. Moreover, all of the herein elements are changing quite fast, making it near impossible to keep the organization and its team up-to-date with the best method, technology or approach.

Do I really need to build a complete data science team from scratch?

The company Syngenta developed an award-winning suite of analytics tools by tapping into expertise outside the organization — including talent available through open-innovation platforms. How does a company operating outside the major technology talent centers gain access to the most innovative data scientists that money can buy? Assuming you can’t recruit the right data analysts to join your team full time, how do you tap into contractors with the knowledge and creativity you need outside your technical core?

Leveraging the potential of outside experts requires close cooperation from in-house employees, who need to feel that it’s good for the business and that it doesn’t threaten their jobs. Cooperation from staff is also essential for framing problems and evaluating options. Syngenta turned to several online crowdsourcing platforms to find talent that could help them increase their R&D efficiency. They were making the most of crowdsourcing platforms and also learning how they could leverage advanced mathematics to develop better varieties of plants.


The extensive usage of key process indicators (KPIs) and dashboards have increased the adoption of business intelligence platforms across industries. Current technologies and competences allow companies to execute data analytics across disparate systems running databases, data warehouses and structured or unstructured data sets, without impacting day-to-day operations or access to data. Some organizations in a more mature stage are already applying machine learning in sophisticated diagnosis or predictive analytics.

While technology is a critical aspect, it is equally important to develop strong data science capabilities within the organization – meaning, deep knowledge in statistics and of the specific domain of the problem to be solved. A data scientist is in high demand nowadays, and it is not an easy role to hire. The case is similar for Hadoop ecosystem architects and developers, as well.

Therefore, once the top executives decide to implement an advanced analytics program, they should make the BI program as robust as possible from the beginning. Then, please do not think one magical data scientist suffices, as he or she is one component of the program. The organization should extend its knowledge by developing and onboarding several roles and consider bringing additional experts from outside the company in order to accelerate the program, making it more flexible and dynamic. This is a critical discussion that will make a big data endeavor a successful one.

Article written by Werther Krause
Image credit by Getty Images, Taxi, Kelvin Murray
Want more? For Job Seekers | For Employers | For Contributors