As the Associate Director of the Master of Science in Analytics program and an assistant professor in the School of Computational Science and Engineering at Georgia Tech, I have learned many helpful lessons over the years from working with tech companies including the likes of juggernauts such as Google, eBay, Symantac and Intel, among others.
Here are my top 10 lessons for anyone considering entering analytics and pursuing a career in this ever-growing, increasingly essential field:
This is good news, as it will lead to many job opportunities. Most companies are looking for data scientists. According to Gartner, “the data scientist role is critical for organizations looking to extract insight from information assets for ‘big data’ initiatives and requires a broad combination of skills that may be fulfilled better as a team.”
I emphasize the “broad combination” aspect here, as breadth of knowledge is important. In today’s world of data, in which millions of emails are sent every second and households consume approximately hundreds of megabytes of data per day, we must think (a lot) about the challenges that come with that, including storage, complex system design, scalability of algorithms, visualization techniques, interaction techniques, statistical tests, etc.
Enabling us to address these challenges are the building blocks of analytics — collection, cleaning, integration, analysis, visualization, presentation and dissemination. With these building blocks as a foundation, you will find that data types inform visualization design; data informs choice of algorithms; visualization informs data cleaning for dirty data; and visualization informs algorithm design, for when the user finds results that don’t make sense.
An essential data analytics skill is the ability to decompose a problem into smaller pieces and identify which ones already have well-known, effective solutions, so as to focus our energy on the remaining pieces that require innovation.
A good book on this topic is Foster Provost and Tom Fawcett’s "Data Science for Business: What You Need to Know About Data Mining and Data-Analytic Thinking" which highlights key data science concepts that I think all data scientists should know, e.g., classification, regression, similarity matching, clustering, co-occurrence grouping, etc. By understanding these core concepts well, we would be able to generalize to related techniques that may come about in the future.
Always have been, and always will be. You will likely spend the majority of your time cleaning data. It’s important work!
Just how dirty are real data? Think of all the different ways a date can be written — Jan 19, 2016; 1/19/16; 2006-01-19; 19/1/16. Think about duplicates, empty rows, different kinds of abbreviations, typos, missing values, trailing spaces, incomplete cells, synonyms, bad formatting, the list goes on.
Data scientists can expect to spend 80 percent of their time on data preparation, which can lead people to feel like data janitors. There is a silver lining, however, with tools such as OpenRefine (formerly Google Refine), which is a free, open source tool for working with dirty data.
Python is one of the top programming languages at tech firms like Google (the other two are Java and C++). It’s easy to write, read, run, debug and works well with others — it’s a great “glue” language, meaning your python program/script can easily call code or libraries written in other languages. Python is often the very first language popular libraries would support.
Certainly, R is also very popular and has strong community support. And if you are thinking about writing production code (e.g., where speed is very important), then C++/C/Java will be much better options.
Even if job descriptions may not mention them, you need to know SQL and algorithms because: 1) many datasets are stored in databases, and 2) you need to know if an algorithm can scale to a large amount of data.
With data-driven documents, seeing is believing. This is a huge competitive edge.
Given that “big data” is so common and machines and disks die — according to Google, about 3 percent of 100,000 hard drives fail within the first three months — companies expect you to know the basics of big data technologies, such as Hadoop and Spark.
Hadoop and Spark are open-source software for reliable, scalable, distributed computing. Fortune 500 companies and many research groups use it, and the cost is low to set up. It will be an essential skill, like SQL.
The Spark project started in 2009 at UC Berkeley’s AMP lab and was open sourced in 2010. You might ask why you should consider a new programming model. MapReduce greatly simplified big data analysis, but as soon as it became popular, users wanted more complex, multi-stage applications (e.g. iterative graph algorithms and machine learning) and more interactive ad-hoc queries. This requires faster data sharing across parallel jobs. Some Spark programs can run more than 10 times as fast as their Hadoop MapReduce counterparts.
Be cautiously optimistic. And be careful of hype. Every day, you would probably hear about new technologies from popular press. We should learn about them. Yet at the same time, we should dig deeper into them, if possible, to understand why and if they indeed are promising.
If people don’t understand your approach, they won’t appreciate it. So, practice your networking and presentation skills! By doing so, you can influence others, and tell them about the great work that you have done!