LASSO BBN
This project learns Bayesian Belief Network structures using LASSO regression. The project is work-in-progress and where appropriate, we have stated work that remains to be done (e.g. TODO
).
You may install lassobbn
from pypi.
pip install lassobbn
Quickstart
Here is a quickstart example. There are basically 4 steps you will need to make.
Learn the structure.
Learn the parameters.
Convert the structure and parameters into a Bayesian Belief Network (BBN).
Convert the BBN into a Join Tree (JT) for exact inference.
1from lassobbn.learn import learn_parameters, learn_structure, to_bbn, to_join_tree, posteriors_to_df
2
3# Step 1. Learn the structure
4df_path = './data/data-binary.csv'
5meta_path = './data/data-binary-complete.json'
6
7parents = learn_structure(df_path, meta_path, n_way=2, ignore_neg_gt=-0.01, ignore_pos_lt=0.05)
8print('parents')
9print(parents)
10print('-' * 15)
11# {'e': ['d!b'], 'd': ['b!a']}
12
13# Step 2. Learn the parameters
14d, g, p = learn_parameters(df_path, parents)
15print('domains')
16print(d)
17print('-' * 15)
18# {'d!b': ['0', '1'], 'e': ['0', '1'], 'd': ['0', '1'], 'b': ['0', '1'], 'b!a': ['0', '1'], 'a': ['0', '1']}
19
20print('structure')
21for pa, ch in g.edges():
22 print(f'{pa} -> {ch}')
23print('-' * 15)
24# d!b -> e
25# d -> d!b
26# b -> d!b
27# b -> b!a
28# b!a -> d
29# a -> b!a
30
31print('parameters')
32for k, arr in p.items():
33 probs = [f'{v:.2f}' for v in arr]
34 probs = ', '.join(probs)
35 print(f'{k}: [{probs}]')
36print('-' * 15)
37# d!b: [1.00, 0.00, 1.00, 0.00, 1.00, 0.00, 0.00, 1.00]
38# e: [0.77, 0.23, 0.08, 0.92]
39# d: [0.79, 0.21, 0.80, 0.20]
40# b: [0.80, 0.20]
41# b!a: [1.00, 0.00, 1.00, 0.00, 1.00, 0.00, 0.00, 1.00]
42# a: [0.19, 0.81]
43
44# Step 3. Get the BBN
45bbn = to_bbn(d, g, p)
46
47# Step 4. Get the Join Tree
48jt = to_join_tree(bbn)
49
50print('bbn')
51print(bbn)
52print('-' * 15)
53# 0|d!b|0,1
54# 1|e|0,1
55# 2|d|0,1
56# 3|b|0,1
57# 4|b!a|0,1
58# 5|a|0,1
59# 0->1
60# 2->0
61# 3->0
62# 3->4
63# 4->2
64# 5->4
65
66print('join tree')
67print(jt)
68print('-' * 15)
69# (d!b,e)
70# (b,d,d!b)
71# (b,b!a,d)
72# (a,b,b!a)
73# |(b,d,d!b) -- d,b -- (b,b!a,d)|
74# |(b,b!a,d) -- b,b!a -- (a,b,b!a)|
75# |(d!b,e) -- d!b -- (b,d,d!b)|
76# (b,d,d!b)--|(b,d,d!b) -- d,b -- (b,b!a,d)|--(b,b!a,d)
77# (b,b!a,d)--|(b,b!a,d) -- b,b!a -- (a,b,b!a)|--(a,b,b!a)
78# (d!b,e)--|(d!b,e) -- d!b -- (b,d,d!b)|--(b,d,d!b)
79
80# Get posteriors
81print('posteriors')
82mdf = posteriors_to_df(jt)
83print(mdf)
84
85# should print
86# 0 1
87# name
88# d!b 0.960997 0.039003
89# e 0.740779 0.259221
90# d 0.795200 0.204800
91# b 0.802900 0.197100
92# b!a 0.840211 0.159789
93# a 0.189300 0.810700
Data
Your data should be a comma-seperated value (CSV) format. All your data should be binary in nature, with the values of 0
or 1
. Here is an example of the CSV data you will need.
1a,b,c,d,e
21,0,0,0,0
31,0,0,0,0
41,0,0,1,1
50,0,0,0,1
60,0,0,0,0
71,0,0,0,1
81,0,0,0,0
91,0,0,1,1
100,0,0,0,1
111,0,0,0,0
Noticed that the first line contains the headers
which represent the names of the variables. In this example file, there are 5 variables a, b, c, d, e
. Also note there are no missing data. This CSV file should be easily read by Pandas
using pd.read_csv(...)
.
TODO
D00: In the future, we will enable other types of variables such as continuous and general categorical variables.
Meta Information
Meta information is information that will help guide the learning procedure. The learning procedure will consider the following.
The ordering of the variables. For now, a complete and partial ordering is allowed.
A list of edges to blacklist. These are edges that will never be allowed even if they are found.
A list of edges to whitelist. These are edges that will always be created even if they are not found.
The meta information you provide should be stored in a JSON file format. Below is example of meta information stored in a JSON file that defines complete ordering. Look at the key ordering
and its associated value. The value is a list of lists (sub-lists) or a nested list. The sequence of the variables are stored inside these sub-lists. Here, we have 5 sub-lists, and in each sublist, only a single element. This ordering implies that a
comes before b
, b
comes before c
and so on. This ordering is a complete ordering since there are no more than one element in each sub-list. Notice that each sub-list is a level of sequence, where variables in earlier sub-lists occur before those in later ones.
1{
2 "ordering": [
3 ["a"],
4 ["b"],
5 ["c"],
6 ["d"],
7 ["e"]
8 ]
9}
Take a look at this next ordering. This ordering is a partial ordering since there is at least one sub-list that has more than one element. In particular, this ordering is not complete since we do not now if a
comes before b
or vice-versa; we have incomplete knowledge. Thus, we specify a
and b
to be at the same level of sequence. For variables at the same level of sequence, they will never be considered as dependent variables of one another (since we do not know their ordering). The meaning of this ordering is that a
and b
comes before c
and d
, and c
and d
comes before e
.
1{
2 "ordering": [
3 ["a", "b"],
4 ["c", "d"],
5 ["e"]
6 ]
7}
Learning
Learning a Bayesian Belief Network (BBN) means to learn the structure and parameters. The structure of a BBN is typically learned first, and then the parameters are learned afterwards. The signature of the the learn_structure(...)
method is as follows.
learn_structure(df_path: str, meta_path: str, n_way=3,
ignore_neg_gt=-0.1, ignore_pos_lt=0.1,
n_regressions=10, solver='liblinear', penalty='l1', C=0.2,
robust_threshold=0.9) -> Dict[str, List[str]]
Since we are using logistic regression with LASSO regularization, you will need to specify how to accomplish the regression with some arguments. The solver can be either liblinear
or saga
. The penalty must be l1
and the regularization strength, C
is a number between [0, 1]. For C
, a smaller value means stronger regularlization. Please take a look at Scikit’s official documentation for additional information.
What is returned is a Python dictionary that stores the child to parent relationships. Here is an example of the dictionary that is returned.
1{
2 "e": ["d!b"],
3 "d": ["b!a"]
4}
After you have learned the child to parent relationships (or equivalently, parent to child relationships), you should then learn the parameters. The signature of the learn_parameters(...)
function is as follows.
learn_parameters(df_path: str, pas: Dict[str, List[str]]) -> \
Tuple[Dict[str, List[str]], nx.DiGraph, Dict[str, List[float]]]
The output of learn_parameters(...)
is a tuple of 3 things.
domains of each variable
graphical structure
conditional probability tables for each variable
TODO
L00: Implement LASSO regression with continuous dependent variable.
L01: Implement LASSO regression with categorical independent variable.
L02: How do we implement LASSO regression with categorical dependent variable?
L03: How do we learn with partial ordering of the variables? (DONE)
L04: How do we learn with no ordering of the variables?
L05: Implement blacklisted or whitelisted edges.
Inference
After you learned the structure and parameters of the BBN, then you can use Py-BBN to perform inference. First, you have to create an instance of a BBN, and then use that BBN instance to create an instance of a Junction Tree (JT). Py-BBN is opened source may be installed on PyPi. This library already lists Py-BBN as a requirement, and by installing this library, you will also install Py-BBN. The methods that you need to pay attention to are as follows.
to_bbn(d, g, p)
: uses the domain informationd
, structureg
and parametersp
to create a Bayesian Belief Network (BBN)to_join_tree(bbn)
: converts a BBN to a Join Tree (JT)posters_to_df(jt)
: gets the posterior information as a data frame
1# Step 3. Get the BBN
2bbn = to_bbn(d, g, p)
3
4# Step 4. Get the Join Tree
5jt = to_join_tree(bbn)
6
7print('bbn')
8print(bbn)
9print('-' * 15)
10# 0|d!b|0,1
11# 1|e|0,1
12# 2|d|0,1
13# 3|b|0,1
14# 4|b!a|0,1
15# 5|a|0,1
16# 0->1
17# 2->0
18# 3->0
19# 3->4
20# 4->2
21# 5->4
22
23print('join tree')
24print(jt)
25print('-' * 15)
26# (d!b,e)
27# (b,d,d!b)
28# (b,b!a,d)
29# (a,b,b!a)
30# |(b,d,d!b) -- d,b -- (b,b!a,d)|
31# |(b,b!a,d) -- b,b!a -- (a,b,b!a)|
32# |(d!b,e) -- d!b -- (b,d,d!b)|
33# (b,d,d!b)--|(b,d,d!b) -- d,b -- (b,b!a,d)|--(b,b!a,d)
34# (b,b!a,d)--|(b,b!a,d) -- b,b!a -- (a,b,b!a)|--(a,b,b!a)
35# (d!b,e)--|(d!b,e) -- d!b -- (b,d,d!b)|--(b,d,d!b)
36
37# Get posteriors
38print('posteriors')
39mdf = posteriors_to_df(jt)
40print(mdf)
41
42# should print
43# 0 1
44# name
45# d!b 0.960997 0.039003
46# e 0.740779 0.259221
47# d 0.795200 0.204800
48# b 0.802900 0.197100
49# b!a 0.840211 0.159789
50# a 0.189300 0.810700
Algorithm
Structure learning of causal Bayesian Belief Networks (BBNs) using regression and sequence information has been reported [Ale20a, Ale20b]. In this section, we will take a less formal approach to explaining the structure learning algorithm. The structure learning algorithm is best understood when a complete ordering of a set of variables is given. Assume we have a causal model for which we know the true structure and parameters, and let’s say this model is a causal Bayesian Belief Network (BBN). Let’s say this model is shown in the figure below with the variables as all binary.
Now, let’s say we have observed data from this causal BBN, and a sample of the data looks as below.
1a,b,c,d,e
21,0,0,0,0
31,0,0,0,0
41,0,0,1,1
50,0,0,0,1
60,0,0,0,0
71,0,0,0,1
81,0,0,0,0
91,0,0,1,1
100,0,0,0,1
111,0,0,0,0
If a user can correctly specify the order of the variables by indicating which variable occurs before which other ones, then we can induce/learn a causal BBN structure from the data. Let’s say a user specifies the order to be a, b, c, d, e
. Note that even though a
does not come before b
and vice-versa, that is okay, since they are tied and we just need an ordering.
The structure learning algorithm iterates over each variable as a dependent variable while regressing on all those that come before it. Since there are 5 variables, there are a maximum of 5 regression equations to run. Since a
is the first variable and no other variables precede it, we will only run 4 regression equations.
\(e = a + b + c + d\)
\(d = a + b + c\)
\(c = a + b\)
\(b = a\)
We will eliminate which independents variables are not a parent of the dependent variable by knowing that the sequence implies time dependency and the coefficient associated with each independent variable indicates prediction strength. Lasso regularization will force the coefficients to zero, and it is expected that each model specified will have non-zero coefficients for those independent variables that are parents of the dependent variable. The following table lists the coefficients of each variable in a model for when the specified variable is the dependent variable.
Dependent Variable |
a |
b |
c |
d |
e |
---|---|---|---|---|---|
a |
0 |
0 |
0 |
0 |
0 |
b |
0 |
0 |
0 |
0 |
0 |
c |
0.8 |
0.3 |
0 |
0 |
|
d |
0 |
0 |
0 |
0 |
0 |
e |
0 |
0 |
0.3 |
0.8 |
0 |
You see that the regression models for a
, b
and d
have no parents. The regression model for c
as the dependent variable suggests that a
and b
are its parents. The regression model for e
as the dependent variable suggests that c
and d
are its parents.
With the sequence to help us build the models, and with using Lasso regularization, we now can induce parent-child relationships between the dependent and non-zero coefficient variables (by non-zero, we mean the absolute value of the coefficient). We can proceed through the models from e
to a
(as the dependent variable) and start drawing the arcs between parent and child one at a time, and where a cycle is formed, skip drawing this arc.
API Documentation
Learn
- lassobbn.learn.do_learn(df_path: str, nodes: List[str], seen: Dict[str, List[str]], ordering_map: Dict[str, List[str]], n_way=3, ignore_neg_gt=- 0.1, ignore_pos_lt=0.1, n_regressions=10, solver='liblinear', penalty='l1', C=0.2, robust_threshold=0.9) None
Recursively learns parents or robust independent variables associated with each variable.
- Parameters
df_path – CSV path.
nodes – List of variables.
seen – Dictionary storing processed/seen variables.
ordering_map – Ordering map.
n_way – Number of n-way interactions. Default is 3.
ignore_neg_gt – Threshold for ignoring negative coefficients.
ignore_pos_lt – Threshold for ignoring positive coefficients.
n_regressions – The number of regressions to do. Default is 10.
solver – Solver. Default is
liblinear
.penalty – Penalty. Default is
l1
.C – Regularization strength. Default is
0.2
.robust_threshold – Robustness threshold. Default is
0.9
.
- Returns
None.
- lassobbn.learn.do_regression(X_cols: List[str], y_col: str, df: pandas.core.frame.DataFrame, solver='liblinear', penalty='l1', C=0.2) sklearn.linear_model._logistic.LogisticRegression
Performs regression.
- Parameters
X_cols – Independent variables.
y_col – Dependent variable.
df – Data frame.
solver – Solver. Default is liblinear.
penalty – Penalty. Default is
l1
.C – Strength of regularlization. Default is
0.2
.
- Returns
Logistic regression model.
- lassobbn.learn.do_robust_regression(X_cols: List[str], y_col: str, df_path: str, n_way=3, ignore_neg_gt=- 0.1, ignore_pos_lt=0.1, n_regressions=10, solver='liblinear', penalty='l1', C=0.2, robust_threshold=0.9) Dict[str, Union[str, List]]
Performs robust regression.
- Parameters
X_cols – List of independent variables.
y_col – Dependent variable.
df_path – Path of CSV file.
n_way – Number of n-way interactions. Default is 3.
ignore_neg_gt – Threshold for ignoring negative coefficients.
ignore_pos_lt – Threshold for ignoring positive coefficients.
n_regressions – The number of regressions to do. Default is 10.
solver – Solver. Default is
liblinear
.penalty – Penalty. Default is
l1
.C – Regularization strength. Default is
0.2
.robust_threshold – Robustness threshold. Default is
0.9
.
- Returns
A dictionary storing parents of a child. The parents are said to be robust.
- lassobbn.learn.expand_data(df_path: str, parents: Dict[str, List[str]]) pandas.core.frame.DataFrame
Expands data with additional columns defined by parent-child relationships.
- Parameters
df_path – CSV path.
parents – Parent-child relationships.
- Returns
Data frame.
- lassobbn.learn.extract_meta(meta_path: str) Tuple[Dict[str, List[str]], List[str]]
Extracts meta data. :param meta_path: Metadata path (JSON file). :return: Tuple; (ordering map, start nodes).
- lassobbn.learn.extract_model_params(independent_cols: List[str], y_col: str, model: sklearn.linear_model._logistic.LogisticRegression) Dict[str, Union[str, float]]
Extracts parameters from models (e.g. coefficients).
- Parameters
independent_cols – List of independent variables.
y_col – Dependent variable.
model – Logistic regression model.
- Returns
Parameters (e.g. coefficients of each independent variable).
- lassobbn.learn.get_data(df_path: str, X_cols: List[str], y_col: str, n_way=3) pandas.core.frame.DataFrame
Gets a data frame with additional columns representing the n-way interactions.
- Parameters
df_path – Path to CSV file.
X_cols – List of variables.
y_col – The dependent variable.
n_way – Number of n-way interactions. Default is
3
.
- Returns
Data frame.
- lassobbn.learn.get_graph(parents: Dict[str, List[str]]) networkx.classes.digraph.DiGraph
Gets a graph
nx.DiGraph
.- Parameters
parents – Dictionary; keys are children, values are list of parents.
- Returns
Graph.
- lassobbn.learn.get_n_way(X_cols: List[str], n_way=3) List[Tuple[str, ...]]
Gets up to all n-way interactions.
- Parameters
X_cols – List of variables.
n_way – Maximum n-way interactions. Default is
3
.
- Returns
List of n-way interactions.
- lassobbn.learn.get_ordering_map(meta: Dict[str, any]) Dict[str, List[str]]
Gets a dictionary specifying ordering. A key is a variable, a value is a list of variables that comes before.
- Parameters
meta – Metadata.
- Returns
Ordering.
- lassobbn.learn.get_robust_stats(robust: pandas.core.frame.DataFrame, robust_threshold=0.9) pandas.core.frame.DataFrame
Computes the robustness statistics.
- Parameters
robust – Data frame of robustness indicators.
robust_threshold – Threshold for robustness. Default is
0.9
.
- Returns
Data frame of variables that are robust.
- lassobbn.learn.get_start_nodes(meta: Dict[str, any]) List[str]
Gets a list of start variables/nodes to kick off the algorithm.
- Parameters
meta – Metadata.
- Returns
Start nodes.
- lassobbn.learn.learn_parameters(df_path: str, pas: Dict[str, List[str]]) Tuple[Dict[str, List[str]], networkx.classes.digraph.DiGraph, Dict[str, List[float]]]
Gets the parameters.
- Parameters
df_path – CSV file.
pas – Parent-child relationships (structure).
- Returns
Tuple; first item is dictionary of domains; second item is a graph; third item is dictionary of probabilities.
- lassobbn.learn.learn_structure(df_path: str, meta_path: str, n_way=3, ignore_neg_gt=- 0.1, ignore_pos_lt=0.1, n_regressions=10, solver='liblinear', penalty='l1', C=0.2, robust_threshold=0.9) Dict[str, List[str]]
Kicks off the learning process.
- Parameters
df_path – CSV path.
meta_path – Metadata path.
n_way – Number of n-way interactions. Default is 3.
ignore_neg_gt – Threshold for ignoring negative coefficients.
ignore_pos_lt – Threshold for ignoring positive coefficients.
n_regressions – The number of regressions to do. Default is 10.
solver – Solver. Default is
liblinear
.penalty – Penalty. Default is
l1
.C – Regularization strength. Default is
0.2
.robust_threshold – Robustness threshold. Default is
0.9
.
- Returns
Dictionary where keys are children and values are list of parents.
- lassobbn.learn.posteriors_to_df(jt: pybbn.graph.jointree.JoinTree) pandas.core.frame.DataFrame
Converts posteriors to data frame.
- Parameters
jt – Join tree.
- Returns
Data frame.
- lassobbn.learn.to_bbn(d: Dict[str, List[str]], s: networkx.classes.digraph.DiGraph, p: Dict[str, List[float]]) pybbn.graph.dag.Bbn
Converts the structure and parameters to a BBN.
- Parameters
d – Domain of each variable.
s – Structure.
p – Parameter.
- Returns
BBN.
- lassobbn.learn.to_join_tree(bbn: pybbn.graph.dag.Bbn) pybbn.graph.jointree.JoinTree
Converts a BBN to a Join Tree.
- Parameters
bbn – BBN.
- Returns
Join Tree.
- lassobbn.learn.to_robustness_indication(params: pandas.core.frame.DataFrame, ignore_neg_gt=- 0.1, ignore_pos_lt=0.1) pandas.core.frame.DataFrame
Checks if each coefficient value is “robust”. A coefficient is NOT robust if it is less
ignore_neg_gt
or if it is less thanignore_pos_lt
.- Parameters
params – Data frame of parameters.
ignore_neg_gt – Threshold. Default is
-0.1
.ignore_pos_lt – Threshold. Default is
0.1
.
- Returns
Data frame (all 1’s and 0’s) indicating robustness.
- lassobbn.learn.trim_parents(parents: List[str]) List[str]
Prunes or trims down the list of parents. There might be duplicates as a result of compound or n-way interactions.
- Parameters
parents – List of parents.
- Returns
List of (pruned/trimmed) parents.
- lassobbn.learn.trim_relationships(rels: Dict[str, List[str]]) Dict[str, List[str]]
Trims/prune parent-child relationships.
- Parameters
rels – Dictionary of parent-child relationships.
- Returns
Dictionary of trimmed parent-child relationships.
Other APIs
If you like lassobbn
, you might be interested in other products.
Py-BBN
pybbn is an open-source Bayesian Belief Network project for causal and exact inference!
Turing BBN
turing_bbn is a C++17 implementation of py-bbn; take your causal and probabilistic inferences to the next computing level!
PySpark BBN
pyspark-bbn is a is a scalable, massively parallel processing MPP framework for learning structures and parameters of Bayesian Belief Networks BBNs using Apache Spark.
Bibliography
- Ale20a
F. Alemi. Constructing causal networks through regressions: a tutorial. Quality Management Health Care, 29(2):270–278, 2020.
- Ale20b
F. Alemi. Worry less about the algorithm, more about the sequence of events. Mathematical Biosciences and Engineering, 17(6):6557–6572, 2020.
Indices and tables
Copyright
Software
Copyright 2021 Farrokh Alemi and Jee Vang
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
Art
Copyright 2021 Daytchia Vang
Citation
@misc{alemi_2021,
title={lasso-bbn},
url={https://lasso-bbn.readthedocs.io/},
author={F. Alemi, J. Vang},
year={2021},
month={Aug}}
Acknowledgement
This software was funded by Department of Health Administration and Policy (HAP), under the College of Health and Human Services (CHHS) at George Mason University (GMU). The contents are those of the author(s) and do not necessarily represent the official views of, nor an endorsement, by HAP, CHHS, or GMU.