# Structure Learning in Bayesian Networks Eran Segal Weizmann Institute

Yüklə 477 b.
 tarix 30.07.2018 ölçüsü 477 b. • ## Purposes of structure learning

• Discover the dependency structure of the domain
• Goes beyond statistical correlations between individual variables and detects direct vs. indirect correlations
• Set expectations: at best, we can recover the structure up to the I-equivalence class
• Density estimation
• Estimate a statistical model of the underlying distribution and use it to reason with and predict new instances  • ## Constraint based methods

• View the Bayesian network as representing dependencies
• Find a network that best explains dependencies
• Limitation: sensitive to errors in single dependencies
• ## Score based approaches

• View learning as a model selection problem
• Define a scoring function specifying how well the model fits the data
• Search for a high-scoring network structure
• Limitation: super-exponential search space
• ## Bayesian model averaging methods

• Average predictions across all possible structures
• Can be done exactly (some cases) or approximately • ## Goal: Find the best minimal I-Map for the domain

• G is an I-Map for P if I(G)I(P)
• Minimal I-Map if deleting an edge from G renders it not an I-Map
• G is a P-Map for P if I(G)=I(P)
• ## Strategy

• Query the distribution for independence relationships that hold between sets of variables
• Construct a network which is the best minimal I-Map for P • ## Reverse factorization theorem

•  G is an I-Map of P
• ## Algorithm for constructing a minimal I-Map

• Fix an ordering of nodes X1,…,Xn
• Select parents of Xi as minimal subset of X1,…,Xi-1, such that Ind(Xi ; X1,…Xi-1 – Pa(Xi) | Pa(Xi))
• ## (Outline of) Proof of minimal I-map

• I-map since the factorization above holds by construction
• Minimal since by construction, removing one edge destroys the factorization • ## Simplifying assumptions

• Network has bound in-degree d per node
• Oracle can answer Ind. queries of up to 2d+2 variables
• Distribution P has a P-Map
• ## Algorithm

• Step I: Find skeleton
• Step II: Find immoral set of v-structures
• Step III: Direct constrained edges • ## For each pair X,Y query all Z for Ind(X;Y | Z)

• X–Y is in skeleton if no Z is found
• If graph in-degree bounded by d  running time O(n2d+2)
• Since if no direct edge exists, Ind(X;Y | Pa(X), Pa(Y))
• ## Reminder

• If there is no Z for which Ind(X;Y | Z) holds, then XY or YX in G*
• Proof: Assume no Z exists, and G* does not have XY or YX
• Then, can find a set Z such that the path from X to Y is blocked
• Then, G* implies Ind(X;Y | Z) and since G* is a P-Map • ## For each pair X,Y query candidate triplets X,Y,Z

• XZY if no W is found that contains Z and Ind(X;Y | W)
• If graph in-degree bounded by d  running time O(n2d+3)
• If W exists, Ind(X;Y|W), and XZY not immoral, then ZW
• ## Reminder

• If there is no W such that Z is in W and Ind(X;Y | W), then XZY is an immorality
• Proof: Assume no W exists but X–Z–Y is not an immorality
• Then, either XZY or XZY or XZY exists
• But then, we can block X–Z–Y by Z
• Then, since X and Y are not connected, can find W that includes Z such that Ind(X;Y | W) • ## Basic query

• Determine whether two variables are independent
• Well studied question in statistics
• ## Common to frame query as hypothesis testing

• Null hypothesis is H0
• H0: Data was sampled from P*(X,Y)=P*(X)P*(Y)
• Need a procedure that will Accept or Reject the hypothesis
• 2 test to assess deviance of data from hypothesis
• Alternatively, use mutual information between X and Y • ## Strategy

• Define a scoring function for each candidate structure
• Search for a high scoring structure
• ## Key: choice of scoring function

• Likelihood based scores
• Bayesian based scores • ## Goal: find (G,) that maximize the likelihood

• ScoreL(G:D)=log P(D | G, ’G) where ’G is MLE for G
• Find G that maximizes ScoreL(G:D) ## Example • ## Proof: • ## The Likelihood score decomposes as:

• Second term does not depend on network structure and thus is irrelevant for selecting between two structures
• Score increases as mutual information, or strength of dependence between connected variable increases
• ## After some manipulation can show:

• To what extent are the implied Markov assumptions valid ## Limitations of Likelihood Score • ## Solutions

• Restricting the hypotheses space
• Limits the overfitting capability of the learner
• Example: restrict # of parents or # of parameters
• Minimum description length
• Description length measures complexity
• Prefer models that compactly describes the training data
• Bayesian methods
• Average over all possible parameter values
• Use prior knowledge ## Bayesian Score ## Marginal Likelihood of Data Given G • ## Recall that for Dirichlet priors

• Where MmH is number of heads in first m examples ## Marginal Likelihood: Binomial Case • ## Actual experiment with P(H) = 0.25 • ## Network structure determines form of marginal likelihood • ## Network structure determines form of marginal likelihood • ## P(Y = H|X = H) = 0.5 + p P(Y = H|X = T) = 0.5 - p • ## (..) are hyperparameters for each family • ## Structure prior P(G)

• Uniform prior: P(G)  constant
• Prior penalizing number of edges: P(G)  c|G| (0
• Normalizing constant across networks is similar and can thus be ignored • ## Parameter prior P(|G)

• BDe prior
• M0: equivalent sample size
• B0: network representing the prior probability of events
• Set (xi,paiG) = M0 P(xi,paiG| B0)
• Note: paiG are not the same as parents of Xi in B0
• Compute P(xi,paiG| B0) using standard inference in B0
• BDe has the desirable property that I-equivalent networks have the same Bayesian score when using the BDe prior for some M’ and P’ • ## Approximation is called BIC score

• Score exhibits tradeoff between fit to data and complexity
• Mutual information grows linearly with M while complexity grows logarithmically with M
• As M grows, more emphasis is given to the fit to the data • ## Bayesian score is consistent

• As M, the true structure G* maximizes the score
• Spurious edges will not contribute to likelihood and will be penalized
• Required edges will be added due to linear growth of likelihood term relative to M compared to logarithmic growth of model complexity • ## BDe requires assessing prior network

• Can naturally incorporate prior knowledge

• ## All are score-equivalent

• G I-equivalent to G’Score(G) = Score(G’) • ## Input:

• Training data
• Scoring function (including priors, if needed)
• Set of possible structures
• Including prior knowledge about structure
• ## Output:

• A network (or networks) that maximize the score
• ## Key Property:

• Decomposability: the score of a network is a sum of terms. • ## Trees

• At most one parent per variable
• ## Why trees?

• Elegant math
• we can solve the optimization problem efficiently (with a greedy algorithm)
• Sparse parameterization
• avoid overfitting while adapting to the data • ## Score = sum of edge scores + constant • ## Algorithm

• Construct graph with vertices: 1,...n
• Set w(ij) = Score(Xj | Xi ) - Score(Xj)
• Find tree (or forest) with maximal weight
• This can be done using standard algorithms in low-order polynomial time by building a tree in a greedy fashion (Kruskal’s maximum spanning tree algorithm)

• ## When score is likelihood, then w(ij) is proportional to I(Xi; Xj). This is known as the Chow & Liu method ## Learning Trees: Example • ## Problem is not easy for more complex networks

• Example: Allowing two parents, greedy algorithm is no longer guaranteed to find the optimal network

• ## Theorem:

• Finding maximal scoring network structure with at most k parents for each variable is NP-hard for k>1 • ## If we bound the in-degree per variable by d, then complexity is exponential in d • ## Define a search space:

• nodes are possible structures
• edges denote adjacency of structures

• ## Search techniques:

• Greedy hill-climbing
• Best first search
• Simulated Annealing
• ... • ## Typical operations: • ## Caching: To update the score after a local change, we only need to rescore the families that were changed in the last move • ## Simplest heuristic local search

• empty network
• best tree
• a random network
• At each iteration
• Stop when no modification improves score
• ## Each step requires evaluating O(n) new changes • ## Greedy Hill-Climbing can get stuck in:

• Local Maxima
• All one-edge changes reduce the score
• Plateaus
• Some one-edge changes leave the score unchanged
• Happens because equivalent networks received the same score and are neighbors in the search space

• ## Standard heuristics can escape both

• Random restarts
• TABU search • ## Idea

• Search the space of equivalence classes
• Equivalence classes can be represented by PDAGs (partially ordered graph)

• PDAGs space has fewer local maxima and plateaus
• There are fewer PDAGs than DAGs • ## Evaluating changes is more expensive

• In addition to search, need to score a consistent network
• ## These algorithms are more complex to implement ## Learning Example: Alarm Network • ## So far, we focused on single model

• Find best scoring model
• Use it to predict next example

• ## Pros:

• We get a single structure
• Allows for efficient use in our tasks
• ## Cons:

• We are committing to the independencies of a particular structure
• Other structures might be as probable given the data • ## Density estimation

• One structure may suffice, if its joint distribution is similar to other high scoring structures
• ## Structure discovery

• Define features f(G) (e.g., edge, sub-structure, d-sep query)
• Compute
• Still requires summing over exponentially many structures • ## Assumptions

• Known total order of variables 
• Maximum in-degree for variables d
• ## Marginal likelihood • ## Posterior probability of particular edge choice • ## Solution: Sample from posterior distribution of P(G|D)

• Estimate feature probability by
• Sampling can be done by MCMC
• Sampling can be done over orderings • ## Define score with local structures

• Example: in tree CPDs, score decomposes by leaves
• ## Prior may need to be extended

• Example: in tree CPDs, penalty for tree structure per CPD
• ## Extend search operators to local structure

• Example: in tree CPDs, we need to search for tree structure
• Can be done by local encapsulated search or by defining new global operations • ## In general, NP-Hard

• Need to resort to heuristic search
• In practice, search is relatively fast (~100 vars in ~10 min):
• Decomposability
• Sufficient statistics
• ## In some cases, we can reduce the search problem to an easy optimization problem

• Example: learning trees Yüklə 477 b.

Dostları ilə paylaş:

Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©muhaz.org 2020
rəhbərliyinə müraciət