Nodes and Edges
This section explains the details of nodes and edges, their different types, and other information that can be useful when simulating data using PARCS.
Node Types
Stochastic Nodes
In the previous section we saw how a node is described by its sampling distribution. This type of node is named stochastic node in PARCS. For a distribution with many parameters, the general description of these nodes follows this pattern:
N: outputDistribution(param1_=..., param2_=..., ...)
As explained in the theoretical background, parameters accept bias, linear, interactions, and quadratic terms of the parents. For example, if a node has 3 parents A, B and C, possible terms are:
A, B, C, A^2, AB, AC, B^2, BC, C^2
and hence, a possible combination looks something like A^2 + 2A + BC -1
.
Stochastic nodes can be marked with a feature called Correction. To explain this feature, look at the graph below:
1from pyparcs import Description, Graph
2
3description = Description({'A': 'normal(mu_=0, sigma_=1)',
4 'B': 'bernoulli(p_=0.3)'})
5graph = Graph(description)
6samples, _ = graph.sample(5)
7print(samples)
8# A B
9# 0 -1.527236 0.0
10# 1 0.385980 1.0
11# 2 -0.217019 0.0
12# 3 -0.137873 0.0
13# 4 0.462656 1.0
This is a normal simulation graph. Now what happens if we describe B as a child of A?
1from pyparcs import Description, Graph
2
3description = Description({'A': 'normal(mu_=0, sigma_=1)',
4 'B': 'bernoulli(p_=2A)',
5 'A->B': 'identity()'})
6graph = Graph(description)
7samples, _ = graph.sample(5)
8print(samples)
9# pyparcs.core.exceptions.DistributionError: Bern(p) probabilities are out of [0, 1] range
Of course the range of 2A
is bigger than [0, 1]
; therefore, the bernoulli distribution
receives invalid success probabilities for some of the samples. For distribution parameters with
a valid range smaller than real range, PARCS provide a simple sigmoid transformation, such that
it maps the real range to a custom range. For the bernoulli parameter, a standard sigmoid
transformation does the job, as it maps the values to [0, 1]
. For other parameters such as
the normal’s standard deviation, or the exponential’s rate, we can obtain wider ranges by customizing the
sigmoid function. This function is activated by putting a correction[]
after the distribution:
1description = Description({'A': 'normal(mu_=0, sigma_=1)',
2 'B': 'bernoulli(p_=2A), correction[]',
3 'A->B': 'identity()'})
4graph = Graph(description)
5samples, _ = graph.sample(5)
6print(samples)
7# A B
8# 0 -0.135696 0.0
9# 1 0.085163 0.0
10# 2 0.428591 1.0
11# 3 -0.586112 0.0
12# 4 1.222686 0.0
the correction function has three parameters: lower, upper, target_mean
. As a default, lower
and upper parameters are 0 and 1. If we want to impose a certain mean on the corrected
parameter, we can set the target_mean
parameter also. This parameter is useful, for example,
when we want to simulate a bernoulli node to have certain expected values.
1description = Description({'A': 'normal(mu_=0, sigma_=1)',
2 'B': 'bernoulli(p_=2A), correction[lower=0, upper=1, target_mean=0.3]',
3 'A->B': 'identity()'})
4graph = Graph(description)
5samples, _ = graph.sample(1000)
6
7print(samples['B'].mean())
8# 0.319
Note that by using correction, the given argument to the corrected parameter is wrapped with a sigmoid function, and the target mean achieved by updating the bias value. So it is better to read carefully the formula of correction in order to include it in your simulations. Overall, this feature can be useful as researchers often use this trick to sample from the distributions with restricted parameters.
Deterministic Nodes
How to describe a simulation variable as \(Y = a_0 + a_1X_1 + a_2\sin{(X_2)} - a_3\log
(X_3^3)\)? Clearly, we cannot use the stochastic nodes. Apart from the fact that only bias,
linear, interaction and quadratic terms are allowed, the described variable is not a random
variable, but a deterministic one. You can use deterministic(_, _)
nodes to implement such
nodes in the graph. First argument points to a python script which consists of custom python
functions, and second argument specifies the name of the function:
1from pyparcs import Description, Graph
2import numpy as np
3np.random.seed(42)
4
5description = Description({'A': 'bernoulli(p_=0.3)',
6 'B': 'normal(mu_=0, sigma_=1)',
7 'C': 'deterministic(./custom_functions.py, log_sin)',
8 'A->C': 'identity()', 'B->C': 'identity()'})
9graph = Graph(description)
10samples, errors = graph.sample(2)
11print(samples)
12# A B C
13# 0 0.0 1.651819 2.204312
14# 1 0.0 -1.010956 3.559511
15
16print(errors)
17# A B C
18# 0 0.374540 0.950714 0.731994
19# 1 0.598658 0.156019 0.155995
and the custom_functions.py
file is:
1import numpy as np
2
3
4def log_sin(data):
5 """some custom functions"""
6 return 1 + 2 * np.log(data['A']+1) + 3 * np.sin(data['B']**2)
Some remarks:
The parents of the deterministic edge are only specified by the edges in the outline, and the columns that we read from the
data
object in the custom function.Because of (1) the
infer_edges
option then does not work for deterministic nodesThe
data
object which the function receives, is a pandas DataFrame. Note that the edge functions are already applied on the data, before being passed to the node.Even though not needed, the error vector is sampled for the deterministic node too. Apart from the consistency, this will help us later if we want to compare two graphs over a unique datapoint when a deterministic node is replaced by a stochastic one.
Warning
PARCS identifies the custom function by appending the given path to the file, to the python
directory. Therefore, be careful not to have a name conflict with other functions in your python
path. A good cautionary measure is to always use a unique prefix for the functions such as
parcs_custom_...
What if we wanted to create a stochastic node (e.g. normally distributed variable) but with custom equation for its parameters? Currently, this is not immediately possible; but with as a quick workaround, you can first create a dummy deterministic node and pass it to the stochastic node:
1from pyparcs import Description, Graph
2import numpy as np
3np.random.seed(42)
4
5description = Description({'A': 'bernoulli(p_=0.3)',
6 'B': 'normal(mu_=0, sigma_=1)',
7 'C_dummy': 'deterministic(./custom_functions.py, log_sin)',
8 'C': 'normal(mu_=C_dummy, sigma_=1)',
9 'A->C_dummy': 'identity()', 'B->C_dummy': 'identity()',
10 'C_dummy->C': 'identity()'})
11graph = Graph(description)
12samples, _ = graph.sample(2)
13print(samples)
14# A B C_dummy C
15# 0 0.0 1.651819 2.204312 2.823166
16# 1 0.0 -1.011057 3.559830 1.988763
Later we introduce a small feature to better represent and filter out the dummy nodes from the last samples data frame.
Data Nodes
In many simulation studies, it is desirable to augment a real dataset with some simulated nodes; an example is simulating a missing data scenario where a missingness mask is simulated for a given dataset. Data nodes allow us to read an external CSV file to sample a node. The argument of data nodes are the CSV file path and name of the variable in the CSV file.
1,A,B
20,10,100
31,11,101
42,12,102
1from pyparcs import Description, Graph
2import numpy as np
3np.random.seed(42)
4
5description = Description({'A': 'data(./dataset.csv, A)',
6 'B': 'data(./dataset.csv, B)',
7 'C': 'normal(mu_=A+B, sigma_=1)',
8 'A->C': 'identity()',
9 'B->C': 'identity()'})
10graph = Graph(description)
11samples, errors = graph.sample(5)
12
13print(samples)
14# A B C
15# 0 11 101 112.618855
16# 1 11 101 110.988943
17# 2 10 100 110.256234
18# 3 12 102 115.879470
19# 4 12 102 113.091568
20
21print(errors)
22# A B C
23# 0 0.374540 0.374540 0.731994
24# 1 0.598658 0.598658 0.155995
25# 2 0.058084 0.058084 0.601115
26# 3 0.708073 0.708073 0.969910
27# 4 0.832443 0.832443 0.181825
Some remarks:
The error vectors for A and B are identical. This is because we are sampling from the joint \((A, B)\) distribution. In other words, the rows of the CSV file are being sampled, rather than the single elements.
The length of the dataset is 3, but in the example, we sampled 5 data points. Because of that, we have duplicate samples for \((A, B)\). The interpretation of this behavior is that we are sampling from a distribution which is determined by the dataset. Just like a parametric distribution such as Bernoulli, realizations may repeat.
We can sample the entire dataset once, using the
full_data=True
in the sample method. This way, the size of the sample is determined by the length of the dataset, and every data point is sampled once.
1# A B C
2# 0 10 100 110.618855
3# 1 11 101 110.988943
4# 2 12 102 114.256234
5
6print(errors)
7# A B C
8# 0 0.0 0.0 0.731994
9# 1 0.5 0.5 0.155995
Note
Data nodes must have no incoming edges in the outline
Constant Nodes
This class of nodes provides a simple way of adding a constant-value node to the graph. The syntax in the description file is as simple as creating a node by constant(<value>)
1description = Description({'A': 'constant(2)',
2 'B': 'constant(3)'})
3graph = Graph(description)
4samples, _ = graph.sample(3)
5print(samples)
6# A B
7# 0 2.0 3.0
8# 1 2.0 3.0
9# 2 2.0 3.0
Note
We can simulate deterministic and constant nodes by hacking stochastic normal
node with mu_
being a desired constant or determinist term, and sigma_
being 0. Using the deterministic and constant nodes, however, adds more to the clarity of the simulation and the description file.
Edge Correction
Similar to the concept of distribution parameter correction for stochastic nodes, edges can also be marked for a correction. For edges, this is equal to normalization of the input values using the mean and standard deviation of the parent node distribution.
1description = Description({'A': 'normal(mu_=100, sigma_=10)',
2 'B': 'normal(mu_=A, sigma_=1)',
3 'A->B': 'identity()'})
4graph = Graph(description)
5samples, errors = graph.sample(3)
6# A B
7# 0 96.801476 98.453296
8# 1 106.188546 106.438423
9# 2 89.890436 88.879378
10
11description = Description({'A': 'normal(mu_=100, sigma_=10)',
12 'B': 'normal(mu_=A, sigma_=1)',
13 'A->B': 'identity(), correction[]'})
14graph = Graph(description)
15samples, _ = graph.sample(use_sampled_errors=True,
16 sampled_errors=errors)
17# A B
18# 0 96.801476 1.445580
19# 1 106.188546 0.967103
20# 2 89.890436 -1.897181
Edge correction for other edge function also has the same effect. The difference is, after normalization, the input values go through the transformation defined by the edge function.