Joel Sokol is founding Director of Georgia Tech's interdisciplinary Master of Science in Analytics degree and an Associate Professor in the Stewart School of Industrial & Systems Engineering (ISyE). Dr. Sokol's primary research interests are in sports analytics and applied operations research. He has worked with teams or leagues in all three major American sports and has received Georgia Tech's highest awards for teaching.
icrunchdata speaks with education leaders about their experience, programs, and latest initiatives. Today, Dr. Sokol comments on his industry-leading LRMC method for predictive modeling of the NCAA basketball tournament, Georgia Tech's MS Analytics degree, and the future of analytics.
When I was a kid I was a big sports fan. In college I didn't know what I wanted to major in, so I tried a lot of different things, and I didn't really like them much – at least not enough to do them as a career. Then one of my roommates said "Hey, I'm taking this class that's very mathy, but it's more applied, and I think you'll like it."
I took it with him – it was an optimization course in operations research – and I loved it. I decided that was what I wanted to do. It seemed like every week we were learning something where I could say, "Hey, I can apply this to analyzing baseball" or "This would help with my fantasy football team."
So from the time I started learning analytics, I was thinking about how it could be applied to sports. When I was in grad school, I wrote a paper on optimizing baseball teams, particularly batting orders. So it all started from there.
Now I'm leading the team of people that puts out the LRMC rankings every year. I've done some consulting for sports teams in three different leagues: MLB, the NFL, and the NBA. It's a lot of fun doing it.
I've worked with five baseball teams, to varying degrees, and I've started working with a football team and a basketball team.
I've had a handful of students looking at ways to improve the model, and they've come up with some very smart improvements over the years – varying home court, removing schedule bias, transforming the function, etc. – but it turns out that as smart as they are and as correct as they are, there's really not much difference in the results. We've discovered that there's sort of a limit.
You can't predict much better than 75 percent of the games correctly, and the reason seems to be that there's such large a random component in college basketball, especially compared to how well mathematical models and experts can pinpoint how good teams are. If you look at the Las Vegas spread – they're the real experts – about a third of their spreads are different from the true outcome by 11 or more points. People's estimates of how good a team really is are much more precise than plus or minus 11 points, relative to each other. So there's a big random component that's hard to get past.
There are other types of day-to-day human variability that also have a significant effect, but right now we don't have access to either causal or predictive data.
So in that sense, I don't think there are going to be many improvements to any of the leading models, including our LRMC, until we’re able to capture more of these human factors.
I'd say this LRMC model is probably one. The first year it went live, we put it together it was after the 2002-2003 season. Georgia Tech had played in a holiday tournament against Tennessee; they were ahead by a point with just a couple of seconds left, and then a Tennessee player hit a half-court shot to win the game. At the end of the regular season, some of the experts thought that if Georgia Tech had won one more game, they would have had a shot being in the NCAA tournament. That made me think back to the Tennessee game – If some guy hits a last-second half-court shot, does it really say that Georgia Tech is a different team than if he'd missed?
After that year I started putting the model together. I knocked on the door of my statistics-expert colleague, Paul Kvam, who is no longer at Tech. We put this model together and tested it on the fly for the first time the next year, in 2003-2004.
Before the NCAA tournament started that year our model was basically the only one that was predicting Georgia Tech going to the Final Four. It was a little bit worrisome, because we were trying to make the case that we had a completely unbiased mathematical model, and here we were at Georgia Tech as the only ones picking Georgia Tech to go to the Final Four. And of course, Tech did make it to the Final Four that year – they made us look good. Them, and some luck, probably. Tech played lots of close games in that run, and every round we kept saying Tech was likely to win. That helped put LRMC on the map.
A few years after that our model correctly predicted the NCAA tournament’s Final Four, the finalists, and the winner – even the NIT winner. Again, there was a significant luck component – because there’s so much randomness I don't expect it to ever happen again – but it really helped in terms of getting us attention and getting people to pay attention to it.
Yes. One of our Master’s of Science in Analytics students is working for a sports startup, and a second interned for an NFL team. Another of our M.S. Analytics students has his own sports analytics startup, and yet another seems to be moving in that direction. At the undergraduate level, I have an excellent research student who is going to be an analytics intern with a Major League Baseball team this summer.
Overall, it’s hard to get a job in sports analytics. There are so many people with the combination of technical skills and sports interest who apply that the competition is very difficult. Many of the people who get sports analytics jobs have already done sports analytics as a hobby and have done it well – They’ve written about it on blogs and websites, gotten noticed, and been hired. Showing off your own good, original work is the most sure-fire way to get noticed and hired, but you have to have good work to stand out, because there are so many people blogging.
There have been several changes over the past few years. For example, machine learning has really taken off, there has been a proliferation of good analytics software and analytics-friendly platforms, etc. But to me, the biggest change has been that analytics has really become a household word. It used to be that when I told people I do analytics, I’d have to explain what it is. Now, everyone knows what it is and wants to find out exactly what I’m working on.
I see a few areas where analytics is really going to grow. First, I think, will be prescriptive analytics. Initially – and this is just a generalization – descriptive analytics was the hot area, and the cutting-edge companies were the ones who had figured out how to use good predictive analytics. Now, lots of companies have gotten into predictive analytics, and the cutting-edge companies are the ones who are also incorporating good prescriptive analytics. Five or 10 years from now, I think lots of companies will be incorporating prescriptive analytics as well.
Another growth area in analytics is, I think, going to be in trying to outthink analytics systems. A good parallel might be an internet search. First, Google got really good at figuring out which were the best sites to send you to. Then, people started getting very good at gaming the system – figuring out ways to trick Google’s algorithm into sending you to their website, even if it wasn’t the best.
I think there’s a coming parallel growth in analytics. For example, now that there are good algorithms for determining whether an individual is a good loan risk, it won’t be long before someone works out the best way to fool those algorithms. When self-driving cars become more common, it probably won’t take that long for one driving algorithm to figure out how to exploit another driving algorithm – “Get there faster when you’re stuck in a jam! Our self-driving algorithm gets Company X’s cars to let our car merge ahead in traffic 92 percent of the time!” Analytics to trick others’ analytics are coming soon, if they’re not here already.
And finally, I think that analytics will grow in the area of human behavior. Most of the current analytics algorithms don’t do a good job of accounting for human behavior and human reaction. I think the ability to do that can and will be a big differentiator, especially as we continue to advance both in analytics techniques and in our knowledge of how humans think and act.