Machine Learning and Big Data
Machine Learning and Big Data
Traditional predictive model building has been around for many many decades.
It's a mainstream technology and standard business practice in many industries including the insurance, banking and the health care sectors. In the last several years machine learning has taken off in nearly all sectors. It's the smarter internet, an enabler of intelligence.
“The technical requirements for this Big Math are as wide and varied as the history of the math itself”
At its core, predictive model building is about *predictions* - predicting the rise and fall of markets, of credit card swipes and fraud, of equipment failures and planned replacements, of diseases and treatments, of many many things. These predictions drive business decisions, expose patterns, save lifes, and generally enable people to act smarter and live better.
Traditionally model building has been done with small data - data that fits on a practioners single laptop, or perhaps single server. But more data can build better, more predictive models - and the falling price of storage has given every business the chance to hold onto every bit of data - Big Data.
Machine Learning can use that Big Data to do better model building - not just bigger or more models (that happens too) but truely more predictive models. To use that Big Data now requires a cluster of servers instead of a single machine - and it requires parallel and distributed computation. Thus Machine Learning is entering the realm of HPC - crunching terabytes and petabytes instead of gigabytes - to build the next wave of prediction engines.
The technical requirements for this Big Math are as wide and varied as the history of the math itself. Deep Learning / Neural Nets needs as many floating point operations (FLOPS) per byte of data as it can get, and thus is ideally suited for GPU computing. Generalized Linear Modeling (includes Logistics Regression) scales very strongly, and can readily use both more CPU cores and more memory bandwidth - a billion-row logistic regression can complete in a few seconds on a modest EC2 cluster. Gradient Boosting Machines can be network bound - once the data doesn't fit on a single server. There is no one-size fits-all solution either for the math itself, or which math is best for which problem.
The Math requires the data to be in flat 2-D tables: features going across (things like age, salary, or prior purchase choices), and observations going down. Again the problem domains can dramatically change the "shape" of the data and the math that applies. In many problems the features are named and intuitive, and the count of observations can run from the millions to billions (40 years of credit card history) to trillions (network packet inspection to detect intrusion attacks). In medical domain, most studies have very few observations (100's to 1000's of people) but perhaps 10,000's of features (the individual people are complex). In text analytics each "feature" might be the presence or absence of a single word - leading to millions of features per document for millions of documents, but the data is both very sparse (the feature says "No!" 1000x more often than "Yes") and very compressable - requiring math that works on highly compressed data formats.
The real world is of course not 2-D, so the Big Data has to be carefully prepared before the Big Math can work on it - and this is a non-standard ETL problem, often called "data munging". Basically, the practioners need to be able to play with the data; to be able to throw out outliers (or not, if the "outlier" is the rare cancer event you're trying to predict); to impute missing data; to group data on different criteria; to merge data from many sources (classic JOINs); to normalize unrelated datasets, perhaps take log or other complex functions of highly skewed datasets... to explore the big data. The practioners need an *interactive* tool that works on tera-scale data, not a batch process, and not limited to simple slicing & or moving averages.
And of course the output of the predictive model also does not standalone – but must live in a flow of real world events where the predictions can be used to shape the future! That means the model has to move out of the realm of mathematics and into existing workflows. Historically, this has been done when the data scientist hands the carefully crafted model over to a programming team tasked with e.g. allowing or denying a credit card swipe. The model is transformed into code - and that translation act is fraught with opportunities for errors; the programmers may not be aware of all the nuances of the math, and the data scientist may have made unrealistic assumptions about the world.
While the path to getting a predictive model into production is long, the result can be well worth it: dramatic drops in fraud, or in unplanned downtime costs, or even lives saved directly.
We at H2O.ai have been building the product H2O - a Machine Learning tool that runs on clusters to build models 1000x faster - or with 1000x more data – than the older generation of tools. We use cutting edge state-of-the-art ML algorithms; stuff that was literally invented just last year (Generalized Low Rank Modeling), as well as the tried-and-true algorithms (Generalized Linear Modeling) and highly effective algorithms like Neural Nets (Deep Learning) and Gradient Boosting Machines.
We are focused on making Machine Learning enterprise-ready, easier to deploy and use; fast and robust; both with on-demand in-cloud clusters and on-premise secure clusters; with models that can directly drop into high volume high speed scoring engines.
We believe Machine Learning is the Next Big Thing - a direct enabler of human intelligennce encoded in a mathematical model that is used to predict the future. We are dedicated to bringing the benefits of Machine Learning to everybody.