Advanced Machine Learning by Hilary Mason


Summary: Very good overview, but too shallow in details

I have seen presentations made by Hilary already (e.g. from Strata) and I think they were very good in terms of being presented as conference materials. During conference, you have limited time and you obviously want to show as much as possible. On the other hand, when it comes to lectures and workshops you have as much time as you can devote to the topic. That’s why I have expected much deeper analysis of the topics covered in this particular video material. My expectations were that by watching this video I will learn all the topics and will be able to apply them right after finishing the show.This was not quite the case.

Let’s talk about bight sides first. I must admit that for people who are new to data analysis this video is really a good overview of some of the tools available on the marker. There are loots of various algorithms and applications discussed here. You will learn about various metrics, decision trees, k-nearest neighbor basics, dimensionality reduction, principal component analysis, simhash, Hamming distance, Bloom filters, MapReduce, Hadoop. And this is definitely a benefit for people who are not yet familiar with these topics. However, there is another side of the coin. The deepness of the lecture is quite shallow. Basically, you will be presented with some basic examples for each topic, based on very simple data. The fact is, that Hilary presents these basic data analysis quick and dirty way, using CLI and Python, and does it quite efficiently. For sure this will be very useful for computer geeks who are familiar with CLI (mostly Linux users and advanced OS X users) and Python itself. I’d argue that Python is the best tool to visualize the results in a first place, but it’s just a weapon of choice. It could be done in R as well, which in my opinion, is far more suited for data analysis than Python is. When it comes to Windows users, I am pretty sure that they will not benefit from this video, as they mostly use Excel for brief data analysis, have no idea what CLI is and when they hear ‘Python’ they think ‘ZOO’. Advanced Windows developers, please excuse my irony.

The last thing I am not happy with here are explanations of the results. Hilary simply assumes that values produced during calculations are self-explanatory. They are not. I think that for people who see the decision tree for the first time having detailed explanation of how to read the tree would be very helpful. The same refers to other topics presented during the lecture.

I’d suggest this video to people who start working with data analysis and just want to get the right direction. Make sure to dig for details somewhere else. If you don’t know Python and are not familiar with CLI I would strongly consider buying this Video. Maybe “R in a Nutshell” would be a slightly better idea.

In case you come from the Computer Science field and you haven’t had anything to do with statistics so far, take a look at “Head First Statistics” and “Head First Data Analysis”. These books should shade some light on the topic and present it in a gentle way.

And, for the curious, just a few links that might be helpful when watching the video:

http://github.com/hmason/ml_class – material for the class
http://github.com/bitly/data_hacks – tools for data analysis in Python
http://scikit-learn.org/stable/ – Python package used during the class
http://www.graphviz.org – graph visualization tool used during the class
http://github.com/sangelone/python-hashes – Python package used during the class

Product page: http://shop.oreilly.com/product/0636920025610.do