River is a Python library for online machine learning. It is the result of a merger between creme and scikit-multiflow. River’s ambition is to be the go-to library for doing machine learning on streaming data.
As a quick example, we’ll train a logistic regression to classify the website phishing dataset. Here’s a look at the first observation in the dataset.
>>> from pprint import pprint >>> from river import datasets >>> dataset = datasets.Phishing() >>> for x, y in dataset: ... pprint(x) ... print(y) ... break {'age_of_domain': 1, 'anchor_from_other_domain': 0.0, 'empty_server_form_handler': 0.0, 'https': 0.0, 'ip_in_url': 1, 'is_popular': 0.5, 'long_url': 1.0, 'popup_window': 0.0, 'request_from_other_domain': 0.0} True
Now let’s run the model on the dataset in a streaming fashion. We sequentially interleave predictions and model updates. Meanwhile, we update a performance metric to see how well the model is doing.
>>> from river import compose >>> from river import linear_model >>> from river import metrics >>> from river import preprocessing >>> model = compose.Pipeline( ... preprocessing.StandardScaler(), ... linear_model.LogisticRegression() ... ) >>> metric = metrics.Accuracy() >>> for x, y in dataset: ... y_pred = model.predict_one(x) # make a prediction ... metric = metric.update(y, y_pred) # update the metric ... model = model.learn_one(x, y) # make the model learn >>> metric Accuracy: 89.20%