2016-10-04 21:00:00

Machine Learning: a Probabilistic Perspective

[トップページ] > [読書記録] > Machine Learning: a Probabilistic Perspective


Chapter 1: Introduction

Machine learning: what and why?

Types of machine learning

Supervised learning


The need for probabilistic predictions
Real-world applications


Unsupervised learning

Discovering clusters

Discovering latent factors

Discovering graph structure

Matrix completion

Image inpainting
Collaborative filtering
Market basket analysis

Some basic concepts in machine learning

Parametric vs non-parametric models

A simple non-parametric classifier: K-nearest neighbors

The curse of dimensionality

Parametric models for classification and regression

Linear regression

Logistic regression

logistic / logit function: sigmoid function のこと.


Model selection

No free lunch theorem

Chapter 2: Probability


A brief review of probability theory

Discrete random variables

Fundamental rules

Probability of a union of two events
Joint probabilities
Conditional probability

Bayes’s rule

Example: medical diagnosis
Example: Generative classifiers

Independence and conditional independence

Continuous random variables


Mean and variance

Some common discrete distributions

The binomial and Bernoulli distributions

The multinomial and multinoulli distributions

Application: DNA sequence motifs

The Poisson distribution

The empirical distribution

Some common continuous distributions

Gaussian (normal) distribution

Degenerate pdf

The Student’s t distribution

The Laplace distribution

The gamma distribution

The beta distribution

Pareto distribution

Joint probability distributions

Covariance and correlation

The multivariate Gaussian

Multivariate Student t distribution

Dirichlet distribution

Transformations of random variables

Linear transformations

General transformations

Multivariate change of variables

Central limit theorem

Monte Carlo approximation

Example: change of variables, the MC way

Example: estimating π by Monte Carlo integration

Accuracy of Monte Carlo approximation

Information theory


KL divergence

Mutual information

Mutual infomation for continuous random variables

Chapter 3: Generative models for discrete data


Bayesian concept learning




Posterior predictive distribution

A more complex prior

The beta-binomial model




Posterior mean and mode
Posterior variance

Posterior predictive distribution

Overfitting and the black swan paradox
Predicting the outcome of multiple future trials

The Dirichlet-multinomial model




Posterior predictive

Worked example: language models using bag of words

Naive Bayes classifiers

Model fitting

Bayesian naive Bayes

Using the model for prediction

The log-sum-exp trick

Feature selection using mutual information

Classifying documents using bag of words

Chapter 4: Gaussian models




MLE for an MVN


Maximum entropy derivation of Gaussian

Gaussian discriminant analysis

Quadratic discriminant analysis (QDA)

Linear discriminant analysis (LDA)

Two-class LDA

MLE for discriminant analysis

Strategies for preventing overfitting

Regularized LDA

Diagonal LDA

Nearest shrunken centroids classifier

Inference in jointly Gaussian distributions

Statement of the result


Marginals and conditionals of a 2d Gaussian
Interpolating noise-free data
Data imputation

Information form

Proof of the result

Inverse of a partitioned matrix using Schur complements
The matrix inversion lemma
Proof of Gaussian conditioning formulas

Linear Gaussian systems

Statement of the result


Inferring an unknown scalar from noisy measurements
Inferring an unknown vector from noisy measurements
Interpolating noisy data

Proof of the result

Digression: The Wishart distribution

Inverse Wishart distribution

Visualizing the Wishart distribution

Inferring the parameters of an MVN

Posterior distribution of μ

Posterior distribution of Σ

MAP estimation
Univariate posterior

Posterior distribution of μ and Σ

Posterior mode
Posterior marginals
Posterior predictive
Posterior for scalar data
Bayesian t-test
Connection with frequentist statistics

Sensor fusion with unknown precisions

Chapter 5: Bayesian statistics


Summarizing posterior distributions

MAP estimation

No measure of uncertainty
Plugging in the MAP estimate can result in overfitting
The mode is an untypical point
MAP estimation is not invariant to reparameterization

Credible intervals

Highest posterior density regions

Inference for a difference in proportions

Bayesian model selection

Bayesian Occam’s razor

Computing the marginal likelihood (evidence)

Beta-binomial model
Dirichlet-multinomial model
Gaussian-Wishart-Gaussian model
BIC approximation to log marginal likelihood
Effect of prior

Bayes factors

Example: Testing if a coin is fair

Jeffereys-Lindley paradox


Uninformative priors

Jeffereys priors

Example: Jeffreys prior for the Bernoulli and multinoulli
Example: Jeffereys prior for location and scale parameters

Robust priors

Mixture of conjugate priors

Application: Finding conserved regions in DNA and protein sequences

Hierarchical Bayes

Empirical Bayes

Example: beta-binomial model

Example: Gaussian-Gaussian model

Example: predicting baseball scores
Estimating the hyper-parameters

Bayesian decision theory

Bayes estimators for common loss functions

MAP estimate minimizes 0-1 loss
Reject option
Posterior mean minimizes l2 (quadratic) loss
Posterior median minimizes l1 (absolute) loss
Supervised learning

The false positive vs false negative tradeoff

ROC curves and all that
Precision recall curves
False discovery rates

Other topics

Contextual bandits
Utility theory
Sequential decision theory

Frequentist statistics


Sampling distribution of an esitimator


Large sample theory for the MLE

Frequentist decision theory

Bayes risk

Mimimax risk

Admissible estimators

Stein’s paradox
Admissibility is not enough

Desirable properties of estimators

Consistent estimators

Unbiased estimators

Minimum variance estimators

The bias-variance tradeoff

Example: estimating a Gaussian mean
Example: ridge regression
Bias-variance tradeoff for classification

Empirical risk minimization

Regularized risk minimization

Structural risk minimization

Estimating the risk using cross validation

Example: using CV to pick λ for ridge regression
The one standard error rule
CV for model selection in non-probabilistic unsupervised learning

Upper bounding the risk using statistical learning theory

Surrogate loss functions

Pathologies of frequentist statistics

Counter-intuitive behavior of confidence intervals

p-values considered harmful

The likelihood principled

Why isn’t everyone a Bayesian?

Linear regression


Model specification

Maximum likelihood estimation (least squares)

Derivation of the MSE

Geometric interpretation


Robust linear regression

Ridge regression

Basic idea

Numerically stable computation

Connection with PCA

Regularized effects of big data

Bayesian linear regression

Computing the posterior

Computing the posterior predictive

Bayesian inference when σ^2 is unknown

Conjugate prior
Uninformative prior
An example where Bayesian and frequentist inference coincide

EB for linear regression (evidence procedure)

Logistic regression


Model specification

Model fitting


Steepest descent

Newton’s method

Iteratively reweighted least squares (IRLS)

Quasi-Newton (variable metric) methods

l2 regularization

Multi-class logistic regression

Bayesian logistic regression

Laplace approximation

Derivation of Bayesian information criterion (BIC)

Gaussian approximation for logistic regression

Approximating the posterior predictive

Monte Carlo approximation
Probit approximation (moderated output)

Residual analysis (outlier detection)

Online learning and stochastic optimization

Online learning and regret minimization

Stochastic optimization and risk minimization

Setting the step size
Per-parameter step size
SGD compared to batch learning

The LMS algorithm

The perceptron algorithm

A Bayesian view

Generative vs discriminative classifiers

Pros and cons of each approach

Dealing with missing data

Missing data at test time
Missing data at training time

Fisher’s linear discriminant analysis (FLDA)

Derivation of the optimal 1d projection
Extension to higher dimensions and multiple classes
Probabilistic interpretation of FLDA

Generalized linear models and the exponential family


The exponential family




Log partition function

Example: the Berunoulli distribution

MLE for the exponential family

Bayes for the exponential family

Posterior predictive density
Example: Bernoulli distribution

Maximum entropy derivation of the exponential family

Generalized linear models (GLMs)


ML and MAP estimation

Bayesian inference

Probit regression

ML/MAP estimation using gradient-based optimization

Latent variable interpretation

Ordinal probit regression

Multinomial probit models

Multi-task learning

Hierarchical Bayes for multi-task learning

Application to personalized email spam filterling

Application to domain adaptation

Other kinds of prior

Generalized linear mixed models

Example: semi-parametric GLMMs for medical data

Computational issues

Learning to rank

The pointwise approach

The pairwise approach

The listwise approach

Loss functions for ranking

Directed graphical models (Bayes nets)


Chain rule

Conditional independence

Graphical models

Directed graphical models


Naive Bayes classifiers

Markov and hidden Markov models

Medical diagnosis

Genetic linkage analysis

Directed Gaussian graphical models



Plate notation

Learning from complete data

Learning with missing and/or latent variables

Conditional independence properties of DGMs

d-separation and the Bayes Ball algorithm (global Markov properties)

Other Markov properties of DGMs

Influence (decision) diagrams

Mixture models and the EM algorithm

Latent variable models

Mixture models

Mixture of Gaussians

Mixture of multinoullis

Using mixture models for clustering

Mixtures of experts

Application to inverse problems

Parameter estimation for mixture models


Computing a MAP estimate is non-convex

The EM algorithm

EM for GMMs

Auxiliary function
E step
M step
K-means algorithm
Vector quantization
Initialization and avoiding local minimal
MAP estimation
EM for mixture of experts

EM for DGMs with hidden variables

EM for the Student distribution

EM with ν known
EM with ν unknown
Mixtures of Student distributions

EM for probit regression

Theoretical basis for EM

EM monotonically increases the observed data log likelihood

Online EM

Batch EM review
Incremental EM
Stepwise EM

Other EM variants

Model selection for latent variable models

Model selection for probabilistic models

Model selection for non-probabilistic methods

Fitting models with missing data

EM for the MLE of MNV with missing data

Getting started
E step
M step
Extension to the GMM case

Latent linear models

Factor analysis

FA is a low rank parameterization of an MNV

Inference of the latent factors


Mixtures of factor analysers

EM for factor analysis models

Fitting FA models with missing data

Principal components analysis (PCA)

Classical PCA: statement of the theorem


Singular value decomposition (SVD)

Probabilistic PCA

EM algorithm for PCA

Choosing the number of latent dimensions

Model selection for FA/PPCA

Model selection for PCA

Profile likelihood

PCA for categorical data

PCA for paired and multi-view data

Supervised PCA (latent factor regression)

Discriminative supervised PCA

Partial least squares

Canonical correlation analysis

Independent Component Analysis (ICA)

Maximum likelihood estimation

The FastICA algorithm

Modeling the source densities

Using EM

Other estimation principles

Maximizing non-Gaussianity
Minimizing mutual information
Maximizing mutual information (infomax)

Sparse linear models


Bayesian variable selection

The spike and slab model

From the Bernoulli-Gaussian model to l0 regularization


[トップページ] > [読書記録] > Machine Learning: a Probabilistic Perspective

Cpyright (C) 2014-2016 Hiroharu Kato. All Rights Reserved.