# LASSO BBN This project learns Bayesian Belief Network structures using LASSO regression. The project is work-in-progress and where appropriate, we have stated work that remains to be done (e.g. TODO).

You may install lassobbn from pypi.

pip install lassobbn


## Quickstart

Here is a quickstart example. There are basically 4 steps you will need to make.

1. Learn the structure.

2. Learn the parameters.

3. Convert the structure and parameters into a Bayesian Belief Network (BBN).

4. Convert the BBN into a Join Tree (JT) for exact inference.

 1from lassobbn.learn import learn_parameters, learn_structure, to_bbn, to_join_tree, posteriors_to_df
2
3# Step 1. Learn the structure
4df_path = './data/data-binary.csv'
5meta_path = './data/data-binary-complete.json'
6
7parents = learn_structure(df_path, meta_path, n_way=2, ignore_neg_gt=-0.01, ignore_pos_lt=0.05)
8print('parents')
9print(parents)
10print('-' * 15)
11# {'e': ['d!b'], 'd': ['b!a']}
12
13# Step 2. Learn the parameters
14d, g, p = learn_parameters(df_path, parents)
15print('domains')
16print(d)
17print('-' * 15)
18# {'d!b': ['0', '1'], 'e': ['0', '1'], 'd': ['0', '1'], 'b': ['0', '1'], 'b!a': ['0', '1'], 'a': ['0', '1']}
19
20print('structure')
21for pa, ch in g.edges():
22    print(f'{pa} -> {ch}')
23print('-' * 15)
24# d!b -> e
25# d -> d!b
26# b -> d!b
27# b -> b!a
28# b!a -> d
29# a -> b!a
30
31print('parameters')
32for k, arr in p.items():
33    probs = [f'{v:.2f}' for v in arr]
34    probs = ', '.join(probs)
35    print(f'{k}: [{probs}]')
36print('-' * 15)
37# d!b: [1.00, 0.00, 1.00, 0.00, 1.00, 0.00, 0.00, 1.00]
38# e: [0.77, 0.23, 0.08, 0.92]
39# d: [0.79, 0.21, 0.80, 0.20]
40# b: [0.80, 0.20]
41# b!a: [1.00, 0.00, 1.00, 0.00, 1.00, 0.00, 0.00, 1.00]
42# a: [0.19, 0.81]
43
44# Step 3. Get the BBN
45bbn = to_bbn(d, g, p)
46
47# Step 4. Get the Join Tree
48jt = to_join_tree(bbn)
49
50print('bbn')
51print(bbn)
52print('-' * 15)
53# 0|d!b|0,1
54# 1|e|0,1
55# 2|d|0,1
56# 3|b|0,1
57# 4|b!a|0,1
58# 5|a|0,1
59# 0->1
60# 2->0
61# 3->0
62# 3->4
63# 4->2
64# 5->4
65
66print('join tree')
67print(jt)
68print('-' * 15)
69# (d!b,e)
70# (b,d,d!b)
71# (b,b!a,d)
72# (a,b,b!a)
73# |(b,d,d!b) -- d,b -- (b,b!a,d)|
74# |(b,b!a,d) -- b,b!a -- (a,b,b!a)|
75# |(d!b,e) -- d!b -- (b,d,d!b)|
76# (b,d,d!b)--|(b,d,d!b) -- d,b -- (b,b!a,d)|--(b,b!a,d)
77# (b,b!a,d)--|(b,b!a,d) -- b,b!a -- (a,b,b!a)|--(a,b,b!a)
78# (d!b,e)--|(d!b,e) -- d!b -- (b,d,d!b)|--(b,d,d!b)
79
80# Get posteriors
81print('posteriors')
82mdf = posteriors_to_df(jt)
83print(mdf)
84
85# should print
86#              0         1
87# name
88# d!b   0.960997  0.039003
89# e     0.740779  0.259221
90# d     0.795200  0.204800
91# b     0.802900  0.197100
92# b!a   0.840211  0.159789
93# a     0.189300  0.810700


## Data

Your data should be a comma-seperated value (CSV) format. All your data should be binary in nature, with the values of 0 or 1. Here is an example of the CSV data you will need.

 1a,b,c,d,e
21,0,0,0,0
31,0,0,0,0
41,0,0,1,1
50,0,0,0,1
60,0,0,0,0
71,0,0,0,1
81,0,0,0,0
91,0,0,1,1
100,0,0,0,1
111,0,0,0,0


Noticed that the first line contains the headers which represent the names of the variables. In this example file, there are 5 variables a, b, c, d, e. Also note there are no missing data. This CSV file should be easily read by Pandas using pd.read_csv(...).

TODO

• D00: In the future, we will enable other types of variables such as continuous and general categorical variables.

## Meta Information

Meta information is information that will help guide the learning procedure. The learning procedure will consider the following.

• The ordering of the variables. For now, a complete and partial ordering is allowed.

• A list of edges to blacklist. These are edges that will never be allowed even if they are found.

• A list of edges to whitelist. These are edges that will always be created even if they are not found.

The meta information you provide should be stored in a JSON file format. Below is example of meta information stored in a JSON file that defines complete ordering. Look at the key ordering and its associated value. The value is a list of lists (sub-lists) or a nested list. The sequence of the variables are stored inside these sub-lists. Here, we have 5 sub-lists, and in each sublist, only a single element. This ordering implies that a comes before b, b comes before c and so on. This ordering is a complete ordering since there are no more than one element in each sub-list. Notice that each sub-list is a level of sequence, where variables in earlier sub-lists occur before those in later ones.

1{
2  "ordering": [
3    ["a"],
4    ["b"],
5    ["c"],
6    ["d"],
7    ["e"]
8  ]
9}


Take a look at this next ordering. This ordering is a partial ordering since there is at least one sub-list that has more than one element. In particular, this ordering is not complete since we do not now if a comes before b or vice-versa; we have incomplete knowledge. Thus, we specify a and b to be at the same level of sequence. For variables at the same level of sequence, they will never be considered as dependent variables of one another (since we do not know their ordering). The meaning of this ordering is that a and b comes before c and d, and c and d comes before e.

1{
2  "ordering": [
3    ["a", "b"],
4    ["c", "d"],
5    ["e"]
6  ]
7}


## Learning

Learning a Bayesian Belief Network (BBN) means to learn the structure and parameters. The structure of a BBN is typically learned first, and then the parameters are learned afterwards. The signature of the the learn_structure(...) method is as follows.

learn_structure(df_path: str, meta_path: str, n_way=3,
ignore_neg_gt=-0.1, ignore_pos_lt=0.1,
n_regressions=10, solver='liblinear', penalty='l1', C=0.2,
robust_threshold=0.9) -> Dict[str, List[str]]


Since we are using logistic regression with LASSO regularization, you will need to specify how to accomplish the regression with some arguments. The solver can be either liblinear or saga. The penalty must be l1 and the regularization strength, C is a number between [0, 1]. For C, a smaller value means stronger regularlization. Please take a look at Scikit’s official documentation for additional information.

What is returned is a Python dictionary that stores the child to parent relationships. Here is an example of the dictionary that is returned.

1{
2 "e": ["d!b"],
3 "d": ["b!a"]
4}


After you have learned the child to parent relationships (or equivalently, parent to child relationships), you should then learn the parameters. The signature of the learn_parameters(...) function is as follows.

learn_parameters(df_path: str, pas: Dict[str, List[str]]) -> \
Tuple[Dict[str, List[str]], nx.DiGraph, Dict[str, List[float]]]


The output of learn_parameters(...) is a tuple of 3 things.

• domains of each variable

• graphical structure

• conditional probability tables for each variable

TODO

• L00: Implement LASSO regression with continuous dependent variable.

• L01: Implement LASSO regression with categorical independent variable.

• L02: How do we implement LASSO regression with categorical dependent variable?

• L03: How do we learn with partial ordering of the variables? (DONE)

• L04: How do we learn with no ordering of the variables?

• L05: Implement blacklisted or whitelisted edges.

## Inference

After you learned the structure and parameters of the BBN, then you can use Py-BBN to perform inference. First, you have to create an instance of a BBN, and then use that BBN instance to create an instance of a Junction Tree (JT). Py-BBN is opened source may be installed on PyPi. This library already lists Py-BBN as a requirement, and by installing this library, you will also install Py-BBN. The methods that you need to pay attention to are as follows.

• to_bbn(d, g, p) : uses the domain information d, structure g and parameters p to create a Bayesian Belief Network (BBN)

• to_join_tree(bbn) : converts a BBN to a Join Tree (JT)

• posters_to_df(jt) : gets the posterior information as a data frame

 1# Step 3. Get the BBN
2bbn = to_bbn(d, g, p)
3
4# Step 4. Get the Join Tree
5jt = to_join_tree(bbn)
6
7print('bbn')
8print(bbn)
9print('-' * 15)
10# 0|d!b|0,1
11# 1|e|0,1
12# 2|d|0,1
13# 3|b|0,1
14# 4|b!a|0,1
15# 5|a|0,1
16# 0->1
17# 2->0
18# 3->0
19# 3->4
20# 4->2
21# 5->4
22
23print('join tree')
24print(jt)
25print('-' * 15)
26# (d!b,e)
27# (b,d,d!b)
28# (b,b!a,d)
29# (a,b,b!a)
30# |(b,d,d!b) -- d,b -- (b,b!a,d)|
31# |(b,b!a,d) -- b,b!a -- (a,b,b!a)|
32# |(d!b,e) -- d!b -- (b,d,d!b)|
33# (b,d,d!b)--|(b,d,d!b) -- d,b -- (b,b!a,d)|--(b,b!a,d)
34# (b,b!a,d)--|(b,b!a,d) -- b,b!a -- (a,b,b!a)|--(a,b,b!a)
35# (d!b,e)--|(d!b,e) -- d!b -- (b,d,d!b)|--(b,d,d!b)
36
37# Get posteriors
38print('posteriors')
39mdf = posteriors_to_df(jt)
40print(mdf)
41
42# should print
43#              0         1
44# name
45# d!b   0.960997  0.039003
46# e     0.740779  0.259221
47# d     0.795200  0.204800
48# b     0.802900  0.197100
49# b!a   0.840211  0.159789
50# a     0.189300  0.810700


## Algorithm

Structure learning of causal Bayesian Belief Networks (BBNs) using regression and sequence information has been reported [Ale20a, Ale20b]. In this section, we will take a less formal approach to explaining the structure learning algorithm. The structure learning algorithm is best understood when a complete ordering of a set of variables is given. Assume we have a causal model for which we know the true structure and parameters, and let’s say this model is a causal Bayesian Belief Network (BBN). Let’s say this model is shown in the figure below with the variables as all binary. Now, let’s say we have observed data from this causal BBN, and a sample of the data looks as below.

 1a,b,c,d,e
21,0,0,0,0
31,0,0,0,0
41,0,0,1,1
50,0,0,0,1
60,0,0,0,0
71,0,0,0,1
81,0,0,0,0
91,0,0,1,1
100,0,0,0,1
111,0,0,0,0


If a user can correctly specify the order of the variables by indicating which variable occurs before which other ones, then we can induce/learn a causal BBN structure from the data. Let’s say a user specifies the order to be a, b, c, d, e. Note that even though a does not come before b and vice-versa, that is okay, since they are tied and we just need an ordering.

The structure learning algorithm iterates over each variable as a dependent variable while regressing on all those that come before it. Since there are 5 variables, there are a maximum of 5 regression equations to run. Since a is the first variable and no other variables precede it, we will only run 4 regression equations.

• $$e = a + b + c + d$$

• $$d = a + b + c$$

• $$c = a + b$$

• $$b = a$$

We will eliminate which independents variables are not a parent of the dependent variable by knowing that the sequence implies time dependency and the coefficient associated with each independent variable indicates prediction strength. Lasso regularization will force the coefficients to zero, and it is expected that each model specified will have non-zero coefficients for those independent variables that are parents of the dependent variable. The following table lists the coefficients of each variable in a model for when the specified variable is the dependent variable.

Regression Parameters

Dependent Variable

a

b

c

d

e

a

0

0

0

0

0

b

0

0

0

0

0

c

0.8

0.3

0

0

d

0

0

0

0

0

e

0

0

0.3

0.8

0

You see that the regression models for a, b and d have no parents. The regression model for c as the dependent variable suggests that a and b are its parents. The regression model for e as the dependent variable suggests that c and d are its parents.

With the sequence to help us build the models, and with using Lasso regularization, we now can induce parent-child relationships between the dependent and non-zero coefficient variables (by non-zero, we mean the absolute value of the coefficient). We can proceed through the models from e to a (as the dependent variable) and start drawing the arcs between parent and child one at a time, and where a cycle is formed, skip drawing this arc.

## API Documentation

### Learn

lassobbn.learn.do_learn(df_path: str, nodes: List[str], seen: Dict[str, List[str]], ordering_map: Dict[str, List[str]], n_way=3, ignore_neg_gt=- 0.1, ignore_pos_lt=0.1, n_regressions=10, solver='liblinear', penalty='l1', C=0.2, robust_threshold=0.9) None

Recursively learns parents or robust independent variables associated with each variable.

Parameters
• df_path – CSV path.

• nodes – List of variables.

• seen – Dictionary storing processed/seen variables.

• ordering_map – Ordering map.

• n_way – Number of n-way interactions. Default is 3.

• ignore_neg_gt – Threshold for ignoring negative coefficients.

• ignore_pos_lt – Threshold for ignoring positive coefficients.

• n_regressions – The number of regressions to do. Default is 10.

• solver – Solver. Default is liblinear.

• penalty – Penalty. Default is l1.

• C – Regularization strength. Default is 0.2.

• robust_threshold – Robustness threshold. Default is 0.9.

Returns

None.

lassobbn.learn.do_regression(X_cols: List[str], y_col: str, df: pandas.core.frame.DataFrame, solver='liblinear', penalty='l1', C=0.2) sklearn.linear_model._logistic.LogisticRegression

Performs regression.

Parameters
• X_cols – Independent variables.

• y_col – Dependent variable.

• df – Data frame.

• solver – Solver. Default is liblinear.

• penalty – Penalty. Default is l1.

• C – Strength of regularlization. Default is 0.2.

Returns

Logistic regression model.

lassobbn.learn.do_robust_regression(X_cols: List[str], y_col: str, df_path: str, n_way=3, ignore_neg_gt=- 0.1, ignore_pos_lt=0.1, n_regressions=10, solver='liblinear', penalty='l1', C=0.2, robust_threshold=0.9) Dict[str, Union[str, List]]

Performs robust regression.

Parameters
• X_cols – List of independent variables.

• y_col – Dependent variable.

• df_path – Path of CSV file.

• n_way – Number of n-way interactions. Default is 3.

• ignore_neg_gt – Threshold for ignoring negative coefficients.

• ignore_pos_lt – Threshold for ignoring positive coefficients.

• n_regressions – The number of regressions to do. Default is 10.

• solver – Solver. Default is liblinear.

• penalty – Penalty. Default is l1.

• C – Regularization strength. Default is 0.2.

• robust_threshold – Robustness threshold. Default is 0.9.

Returns

A dictionary storing parents of a child. The parents are said to be robust.

lassobbn.learn.expand_data(df_path: str, parents: Dict[str, List[str]]) pandas.core.frame.DataFrame

Expands data with additional columns defined by parent-child relationships.

Parameters
• df_path – CSV path.

• parents – Parent-child relationships.

Returns

Data frame.

lassobbn.learn.extract_meta(meta_path: str) Tuple[Dict[str, List[str]], List[str]]

Extracts meta data. :param meta_path: Metadata path (JSON file). :return: Tuple; (ordering map, start nodes).

lassobbn.learn.extract_model_params(independent_cols: List[str], y_col: str, model: sklearn.linear_model._logistic.LogisticRegression) Dict[str, Union[str, float]]

Extracts parameters from models (e.g. coefficients).

Parameters
• independent_cols – List of independent variables.

• y_col – Dependent variable.

• model – Logistic regression model.

Returns

Parameters (e.g. coefficients of each independent variable).

lassobbn.learn.get_data(df_path: str, X_cols: List[str], y_col: str, n_way=3) pandas.core.frame.DataFrame

Gets a data frame with additional columns representing the n-way interactions.

Parameters
• df_path – Path to CSV file.

• X_cols – List of variables.

• y_col – The dependent variable.

• n_way – Number of n-way interactions. Default is 3.

Returns

Data frame.

lassobbn.learn.get_graph(parents: Dict[str, List[str]]) networkx.classes.digraph.DiGraph

Gets a graph nx.DiGraph.

Parameters

parents – Dictionary; keys are children, values are list of parents.

Returns

Graph.

lassobbn.learn.get_n_way(X_cols: List[str], n_way=3) List[Tuple[str, ...]]

Gets up to all n-way interactions.

Parameters
• X_cols – List of variables.

• n_way – Maximum n-way interactions. Default is 3.

Returns

List of n-way interactions.

lassobbn.learn.get_ordering_map(meta: Dict[str, any]) Dict[str, List[str]]

Gets a dictionary specifying ordering. A key is a variable, a value is a list of variables that comes before.

Parameters

Returns

Ordering.

lassobbn.learn.get_robust_stats(robust: pandas.core.frame.DataFrame, robust_threshold=0.9) pandas.core.frame.DataFrame

Computes the robustness statistics.

Parameters
• robust – Data frame of robustness indicators.

• robust_threshold – Threshold for robustness. Default is 0.9.

Returns

Data frame of variables that are robust.

lassobbn.learn.get_start_nodes(meta: Dict[str, any]) List[str]

Gets a list of start variables/nodes to kick off the algorithm.

Parameters

Returns

Start nodes.

lassobbn.learn.learn_parameters(df_path: str, pas: Dict[str, List[str]]) Tuple[Dict[str, List[str]], networkx.classes.digraph.DiGraph, Dict[str, List[float]]]

Gets the parameters.

Parameters
• df_path – CSV file.

• pas – Parent-child relationships (structure).

Returns

Tuple; first item is dictionary of domains; second item is a graph; third item is dictionary of probabilities.

lassobbn.learn.learn_structure(df_path: str, meta_path: str, n_way=3, ignore_neg_gt=- 0.1, ignore_pos_lt=0.1, n_regressions=10, solver='liblinear', penalty='l1', C=0.2, robust_threshold=0.9) Dict[str, List[str]]

Kicks off the learning process.

Parameters
• df_path – CSV path.

• n_way – Number of n-way interactions. Default is 3.

• ignore_neg_gt – Threshold for ignoring negative coefficients.

• ignore_pos_lt – Threshold for ignoring positive coefficients.

• n_regressions – The number of regressions to do. Default is 10.

• solver – Solver. Default is liblinear.

• penalty – Penalty. Default is l1.

• C – Regularization strength. Default is 0.2.

• robust_threshold – Robustness threshold. Default is 0.9.

Returns

Dictionary where keys are children and values are list of parents.

lassobbn.learn.posteriors_to_df(jt: pybbn.graph.jointree.JoinTree) pandas.core.frame.DataFrame

Converts posteriors to data frame.

Parameters

jt – Join tree.

Returns

Data frame.

lassobbn.learn.to_bbn(d: Dict[str, List[str]], s: networkx.classes.digraph.DiGraph, p: Dict[str, List[float]]) pybbn.graph.dag.Bbn

Converts the structure and parameters to a BBN.

Parameters
• d – Domain of each variable.

• s – Structure.

• p – Parameter.

Returns

BBN.

lassobbn.learn.to_join_tree(bbn: pybbn.graph.dag.Bbn) pybbn.graph.jointree.JoinTree

Converts a BBN to a Join Tree.

Parameters

bbn – BBN.

Returns

Join Tree.

lassobbn.learn.to_robustness_indication(params: pandas.core.frame.DataFrame, ignore_neg_gt=- 0.1, ignore_pos_lt=0.1) pandas.core.frame.DataFrame

Checks if each coefficient value is “robust”. A coefficient is NOT robust if it is less ignore_neg_gt or if it is less than ignore_pos_lt.

Parameters
• params – Data frame of parameters.

• ignore_neg_gt – Threshold. Default is -0.1.

• ignore_pos_lt – Threshold. Default is 0.1.

Returns

Data frame (all 1’s and 0’s) indicating robustness.

lassobbn.learn.trim_parents(parents: List[str]) List[str]

Prunes or trims down the list of parents. There might be duplicates as a result of compound or n-way interactions.

Parameters

parents – List of parents.

Returns

List of (pruned/trimmed) parents.

lassobbn.learn.trim_relationships(rels: Dict[str, List[str]]) Dict[str, List[str]]

Trims/prune parent-child relationships.

Parameters

rels – Dictionary of parent-child relationships.

Returns

Dictionary of trimmed parent-child relationships.

## Other APIs

If you like lassobbn, you might be interested in other products.

### Py-BBN

pybbn is an open-source Bayesian Belief Network project for causal and exact inference! ### Turing BBN

turing_bbn is a C++17 implementation of py-bbn; take your causal and probabilistic inferences to the next computing level! ### PySpark BBN

pyspark-bbn is a is a scalable, massively parallel processing MPP framework for learning structures and parameters of Bayesian Belief Networks BBNs using Apache Spark. ## Bibliography

Ale20a

F. Alemi. Constructing causal networks through regressions: a tutorial. Quality Management Health Care, 29(2):270–278, 2020.

Ale20b

F. Alemi. Worry less about the algorithm, more about the sequence of events. Mathematical Biosciences and Engineering, 17(6):6557–6572, 2020.

# Citation

@misc{alemi_2021,
title={lasso-bbn},
author={F. Alemi, J. Vang},
year={2021},
month={Aug}}


# Authors

## Jee Vang, Ph.D.

• Patreon: support is appreciated

• GitHub: sponsorship will help us change the world for the better

# Acknowledgement

This software was funded by Department of Health Administration and Policy (HAP), under the College of Health and Human Services (CHHS) at George Mason University (GMU). The contents are those of the author(s) and do not necessarily represent the official views of, nor an endorsement, by HAP, CHHS, or GMU.