Structure Learning in Bayesian Networks Eran Segal Weizmann Institute

Yüklə 477 b.
ölçüsü477 b.

Structure Learning in Bayesian Networks

  • Eran Segal

  • Weizmann Institute

Structure Learning Motivation

  • Network structure is often unknown

  • Purposes of structure learning

    • Discover the dependency structure of the domain
      • Goes beyond statistical correlations between individual variables and detects direct vs. indirect correlations
      • Set expectations: at best, we can recover the structure up to the I-equivalence class
    • Density estimation
      • Estimate a statistical model of the underlying distribution and use it to reason with and predict new instances

Advantages of Accurate Structure

Structure Learning Approaches

  • Constraint based methods

    • View the Bayesian network as representing dependencies
    • Find a network that best explains dependencies
    • Limitation: sensitive to errors in single dependencies
  • Score based approaches

    • View learning as a model selection problem
      • Define a scoring function specifying how well the model fits the data
      • Search for a high-scoring network structure
    • Limitation: super-exponential search space
  • Bayesian model averaging methods

    • Average predictions across all possible structures
    • Can be done exactly (some cases) or approximately

Constraint Based Approaches

  • Goal: Find the best minimal I-Map for the domain

    • G is an I-Map for P if I(G)I(P)
      • Minimal I-Map if deleting an edge from G renders it not an I-Map
    • G is a P-Map for P if I(G)=I(P)
  • Strategy

    • Query the distribution for independence relationships that hold between sets of variables
    • Construct a network which is the best minimal I-Map for P

Constructing Minimal I-Maps

  • Reverse factorization theorem

    •  G is an I-Map of P
  • Algorithm for constructing a minimal I-Map

    • Fix an ordering of nodes X1,…,Xn
    • Select parents of Xi as minimal subset of X1,…,Xi-1, such that Ind(Xi ; X1,…Xi-1 – Pa(Xi) | Pa(Xi))
  • (Outline of) Proof of minimal I-map

    • I-map since the factorization above holds by construction
    • Minimal since by construction, removing one edge destroys the factorization

Constructing P-Maps

  • Simplifying assumptions

    • Network has bound in-degree d per node
    • Oracle can answer Ind. queries of up to 2d+2 variables
    • Distribution P has a P-Map
  • Algorithm

    • Step I: Find skeleton
    • Step II: Find immoral set of v-structures
    • Step III: Direct constrained edges

Step I: Identifying the Skeleton

  • For each pair X,Y query all Z for Ind(X;Y | Z)

    • X–Y is in skeleton if no Z is found
    • If graph in-degree bounded by d  running time O(n2d+2)
      • Since if no direct edge exists, Ind(X;Y | Pa(X), Pa(Y))
  • Reminder

    • If there is no Z for which Ind(X;Y | Z) holds, then XY or YX in G*
      • Proof: Assume no Z exists, and G* does not have XY or YX
      • Then, can find a set Z such that the path from X to Y is blocked
      • Then, G* implies Ind(X;Y | Z) and since G* is a P-Map
      • Contradiction

Step II: Identifying Immoralities

  • For each pair X,Y query candidate triplets X,Y,Z

    • XZY if no W is found that contains Z and Ind(X;Y | W)
    • If graph in-degree bounded by d  running time O(n2d+3)
      • If W exists, Ind(X;Y|W), and XZY not immoral, then ZW
  • Reminder

    • If there is no W such that Z is in W and Ind(X;Y | W), then XZY is an immorality
      • Proof: Assume no W exists but X–Z–Y is not an immorality
      • Then, either XZY or XZY or XZY exists
      • But then, we can block X–Z–Y by Z
      • Then, since X and Y are not connected, can find W that includes Z such that Ind(X;Y | W)
      • Contradiction

Answering Independence Queries

  • Basic query

    • Determine whether two variables are independent
    • Well studied question in statistics
  • Common to frame query as hypothesis testing

    • Null hypothesis is H0
    • H0: Data was sampled from P*(X,Y)=P*(X)P*(Y)
    • Need a procedure that will Accept or Reject the hypothesis
      • 2 test to assess deviance of data from hypothesis
      • Alternatively, use mutual information between X and Y

Structure Based Approaches

  • Strategy

    • Define a scoring function for each candidate structure
    • Search for a high scoring structure
  • Key: choice of scoring function

    • Likelihood based scores
    • Bayesian based scores

Likelihood Scores

  • Goal: find (G,) that maximize the likelihood

    • ScoreL(G:D)=log P(D | G, ’G) where ’G is MLE for G
    • Find G that maximizes ScoreL(G:D)


General Decomposition

  • The Likelihood score decomposes as:

  • Proof:

General Decomposition

  • The Likelihood score decomposes as:

    • Second term does not depend on network structure and thus is irrelevant for selecting between two structures
    • Score increases as mutual information, or strength of dependence between connected variable increases
  • After some manipulation can show:

    • To what extent are the implied Markov assumptions valid

Limitations of Likelihood Score

Avoiding Overfitting

  • Classical problem in machine learning

  • Solutions

    • Restricting the hypotheses space
      • Limits the overfitting capability of the learner
      • Example: restrict # of parents or # of parameters
    • Minimum description length
      • Description length measures complexity
      • Prefer models that compactly describes the training data
    • Bayesian methods
      • Average over all possible parameter values
      • Use prior knowledge

Bayesian Score

Marginal Likelihood of Data Given G

Marginal Likelihood: Binomial Case

  • Assume a sequence of m coin tosses

  • By the chain rule for probabilities

  • Recall that for Dirichlet priors

    • Where MmH is number of heads in first m examples

Marginal Likelihood: Binomial Case

Marginal Likelihood Example

  • Actual experiment with P(H) = 0.25

Marginal Likelihood: BayesNets

  • Network structure determines form of marginal likelihood

Marginal Likelihood: BayesNets

  • Network structure determines form of marginal likelihood

Idealized Experiment

  • P(X = H) = 0.5

  • P(Y = H|X = H) = 0.5 + p P(Y = H|X = T) = 0.5 - p

Marginal Likelihood: BayesNets

  • The marginal likelihood has the form:

  • where

  • M(..) are the counts from the data

  • (..) are hyperparameters for each family


  • Structure prior P(G)

    • Uniform prior: P(G)  constant
    • Prior penalizing number of edges: P(G)  c|G| (0
    • Normalizing constant across networks is similar and can thus be ignored


  • Parameter prior P(|G)

    • BDe prior
      • M0: equivalent sample size
      • B0: network representing the prior probability of events
      • Set (xi,paiG) = M0 P(xi,paiG| B0)
        • Note: paiG are not the same as parents of Xi in B0
        • Compute P(xi,paiG| B0) using standard inference in B0
      • BDe has the desirable property that I-equivalent networks have the same Bayesian score when using the BDe prior for some M’ and P’

Bayesian Score: Asymptotic Behavior

  • For M, a network G with Dirichlet priors satisfies

  • Approximation is called BIC score

    • Score exhibits tradeoff between fit to data and complexity
    • Mutual information grows linearly with M while complexity grows logarithmically with M
      • As M grows, more emphasis is given to the fit to the data

Bayesian Score: Asymptotic Behavior

  • For M, a network G with Dirichlet priors satisfies

  • Bayesian score is consistent

    • As M, the true structure G* maximizes the score
      • Spurious edges will not contribute to likelihood and will be penalized
      • Required edges will be added due to linear growth of likelihood term relative to M compared to logarithmic growth of model complexity

Summary: Network Scores

  • Likelihood, MDL, (log) BDe have the form

  • BDe requires assessing prior network

    • Can naturally incorporate prior knowledge
  • BDe is consistent and asymptotically equivalent (up to a constant) to BIC/MDL

  • All are score-equivalent

    • G I-equivalent to G’Score(G) = Score(G’)

Optimization Problem

  • Input:

    • Training data
    • Scoring function (including priors, if needed)
    • Set of possible structures
      • Including prior knowledge about structure
  • Output:

    • A network (or networks) that maximize the score
  • Key Property:

    • Decomposability: the score of a network is a sum of terms.

Learning Trees

  • Trees

    • At most one parent per variable
  • Why trees?

    • Elegant math
      • we can solve the optimization problem efficiently (with a greedy algorithm)
    • Sparse parameterization
      • avoid overfitting while adapting to the data

Learning Trees

  • Let p(i) denote parent of Xi, or 0 if Xi has no parent

  • We can write the score as

  • Score = sum of edge scores + constant

Learning Trees

  • Algorithm

    • Construct graph with vertices: 1,...n
    • Set w(ij) = Score(Xj | Xi ) - Score(Xj)
    • Find tree (or forest) with maximal weight
      • This can be done using standard algorithms in low-order polynomial time by building a tree in a greedy fashion (Kruskal’s maximum spanning tree algorithm)
  • Theorem: Procedure finds the tree with maximal score

  • When score is likelihood, then w(ij) is proportional to I(Xi; Xj). This is known as the Chow & Liu method

Learning Trees: Example

Beyond Trees

  • Problem is not easy for more complex networks

    • Example: Allowing two parents, greedy algorithm is no longer guaranteed to find the optimal network
  • In fact, no efficient algorithm exists

  • Theorem:

    • Finding maximal scoring network structure with at most k parents for each variable is NP-hard for k>1

Fixed Ordering

  • For any decomposable scoring function Score(G:D) and ordering  the maximal scoring network has:

  •  For fixed ordering we have independent problems

  • If we bound the in-degree per variable by d, then complexity is exponential in d

Heuristic Search

  • We address the problem by using heuristic search

  • Define a search space:

    • nodes are possible structures
    • edges denote adjacency of structures
  • Traverse this space looking for high-scoring structures

  • Search techniques:

    • Greedy hill-climbing
    • Best first search
    • Simulated Annealing
    • ...

Heuristic Search

  • Typical operations:

Exploiting Decomposability

  • Caching: To update the score after a local change, we only need to rescore the families that were changed in the last move

Greedy Hill Climbing

  • Simplest heuristic local search

    • Start with a given network
      • empty network
      • best tree
      • a random network
    • At each iteration
    • Stop when no modification improves score
  • Each step requires evaluating O(n) new changes

Greedy Hill Climbing Pitfalls

  • Greedy Hill-Climbing can get stuck in:

    • Local Maxima
      • All one-edge changes reduce the score
    • Plateaus
      • Some one-edge changes leave the score unchanged
      • Happens because equivalent networks received the same score and are neighbors in the search space
  • Both occur during structure search

  • Standard heuristics can escape both

    • Random restarts
    • TABU search

Equivalence Class Search

  • Idea

    • Search the space of equivalence classes
    • Equivalence classes can be represented by PDAGs (partially ordered graph)
  • Advantages

    • PDAGs space has fewer local maxima and plateaus
    • There are fewer PDAGs than DAGs

Equivalence Class Search

  • Evaluating changes is more expensive

    • In addition to search, need to score a consistent network
  • These algorithms are more complex to implement

Learning Example: Alarm Network

Model Selection

  • So far, we focused on single model

    • Find best scoring model
    • Use it to predict next example
  • Implicit assumption:

  • Pros:

    • We get a single structure
    • Allows for efficient use in our tasks
  • Cons:

    • We are committing to the independencies of a particular structure
    • Other structures might be as probable given the data

Model Selection

  • Density estimation

    • One structure may suffice, if its joint distribution is similar to other high scoring structures
  • Structure discovery

    • Define features f(G) (e.g., edge, sub-structure, d-sep query)
    • Compute
      • Still requires summing over exponentially many structures

Model Averaging Given an Order

  • Assumptions

    • Known total order of variables 
    • Maximum in-degree for variables d
  • Marginal likelihood

Model Averaging Given an Order

  • Posterior probability of a general feature

  • Posterior probability of particular choice of parents

  • Posterior probability of particular edge choice

Model Averaging

  • We cannot assume that order is known

  • Solution: Sample from posterior distribution of P(G|D)

    • Estimate feature probability by
    • Sampling can be done by MCMC
    • Sampling can be done over orderings

Notes on Learning Local Structures

  • Define score with local structures

    • Example: in tree CPDs, score decomposes by leaves
  • Prior may need to be extended

    • Example: in tree CPDs, penalty for tree structure per CPD
  • Extend search operators to local structure

    • Example: in tree CPDs, we need to search for tree structure
    • Can be done by local encapsulated search or by defining new global operations

Search: Summary

  • Discrete optimization problem

  • In general, NP-Hard

    • Need to resort to heuristic search
    • In practice, search is relatively fast (~100 vars in ~10 min):
      • Decomposability
      • Sufficient statistics
  • In some cases, we can reduce the search problem to an easy optimization problem

    • Example: learning trees

Yüklə 477 b.

Dostları ilə paylaş:

Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur © 2020
rəhbərliyinə müraciət

    Ana səhifə