|
|
— Ann Nicholson (Workshop co-chair)
This single day workshop is an excellent forum for presenting and hearing about real-world applications of Bayesian networks. It follows the 28th Int. Conference on Uncertainty in AI, the premier conference for presentation of research on Bayesian technology (Aug 15-17th). The call for papers is now out, with submission deadline May 5th (with a week’s extension very likely!).
The aim of the workshop is to foster discussion and interchange about novel contributions that can speak to both the academic and the larger research community. Accordingly, we seek submissions also from practitioners and tool developers as well as researchers. We welcome submissions describing real world applications, whether as stand-alone BNs or where the BNs are embedded in a larger software system. We encourage authors to address the practical issues involved in developing real-world applications, such as knowledge engineering methodologies, elicitation techniques, defining and meeting client needs, validation processes and integration methods, as well as software tools, including visualization and user interaction techniques to support these activities.
We particularly encourage the submission of papers that address the workshop theme of temporal modeling. Recently communities building dynamic Bayes networks (DBNs) and partially observable MDPs (POMDPs) are coming to realize that they are applying their methods to identical applications. Similarly POMDPs and other probabilistic methods are now established in the field of Automated Planning. Stochastic process models such as continuous time Bayes networks (CTBNs) should also be considered as part of this trend. Adaptive and on-line learning models also fit into this focus.
Ann Nicholson (Workshop co-chair)
—Lloyd Allison
Minimum message length (MML) inference is a computational implementation of Bayesian inference, an information-theoretic means of finding high posterior probability hypotheses, devised by Chris Wallace and David Boulton around 1968 (see Wallace's history of MML). MML seeks to minimise a two-part message length , where encodes a hypothesis and some relevant evidence (data). So long as coding follows the principles developed by Claude Shannon, so that the codes enforce the efficiency principle that message lengths , then minimising the MML message length is trivially equivalent to maximising posterior probability:
Since during this sequence we have multiplied by -1, we have also switched from minimising a message length to maximising a probability. And at the end, since and differ only by a positive multiple (see Bayes' theorem), maximising one is the same as maximising the other.
This foundation for minimum message length inference is quite elementary, so the fact that it was not in use before 1968 may be a little surprising. It is probably partly due to limits on computational capacity inhibiting Bayesian statistics and the related dominance of frequentist methods. That there remains any debate about computational Bayesianism, however, is even more surprising.
The application of Bayes' theorem is straightforward for discrete (multinomial) variables governed by a probability function. But consider a problem in which one or more variables are continuous, rather than discrete. Can Bayes' theorem apply?
- Any continuous attribute (variable) can only be measured to some limited accuracy,
.
- So, every datum that is possible under a model (theory, hypothesis) has a probability that is strictly greater than zero, and not just a probability density.
- Any continuous parameter of a model can only be inferred (estimated) to some limited precision
.
- So, every parameter estimate that is possible under a prior has a probability that is strictly greater than zero, and not just a probability density.
- So, in continuous empirical domains both the data and the model spaces have a natural discretisation and Bayes' theorem can always be applied.
However, this is not to say that it is easy to go and make MML work in any given application; in fact it can be quite difficult. After the self evident observations above, a lot of hard work on efficient encodings, search algorithms, code books, invariance, Fisher information, fast approximations, robust heuristics, adaptations to specific problems, and all the rest, remained to be done. Fortunately, MML has been made to work in many general and useful applications, including, but not limited to:
For further information on MML you can peruse my MML web pages. For Chris Wallace's own account of MML see his book Inductive Inference by Minimum Message Length.
— Kevin B Korb
I am interviewed by Adam Ford on the topics of Bayesianism, Falsificationism, AI and the Singularity, etc.
Part I
Part II
Part III
—Ann E Nicholson
Since Bayesians without Borders will in significant part be about Bayesian networks and their uses, in this post I will introduce them to newcomers to the technology.
Bayesian networks (BNs) are an increasingly popular technology for representing and reasoning about problems in which probability plays a role. A Bayesian network is a directed, acyclic graph whose nodes represent random variables and arcs represent direct dependencies. The arcs often, but not always, also represent direct causal connections between the variables. The nodes pointing to are called its parents and collectively are denoted . The relationship between variables is quantified by conditional probability tables (CPTs) associated with each node, namely . The CPTs together compactly represent the full joint distribution. Users can set the values of any combination of nodes in the network that they have observed. This evidence, , propagates through the network, producing a new posterior probability distribution for each variable in the network. There are a number of efficient exact and approximate inference algorithms for performing this probabilistic updating, providing a powerful combination of predictive, diagnostic and explanatory reasoning.
I will illustrate with the design of a BN for a simplified version of a real ecological problem, modeling native fish populations in Victoria. Problem: A local river with tree-lined banks is known to contain native fishpopulations, which need to be conserved. The river passes through croplands and is susceptible to drought conditions. Rainfall helps native fish populations by maintaining water flow, which increases habitat suitability as well as connectivity between different habitat areas. However, rain can also wash pesticides that are dangerous to fish from the croplands into the river. What we want to do is build a BN adequate for modeling this system.
The first step is to decide what the variables of interest are, which will become the nodes in the BN. The abundance of native fish directly depends only on the level of pesticide in the river and the river flow, hence Native Fish Abundance — a so-called "leaf node" — has only those two parent nodes. RiverFlow depends on how much rain falls in a given year (Annual Rainfall), and how much of that water ends up in the river, which means it depends also on the long term Drought Conditions. The amount of pesticide in the river (Pesticide in River) depends on Pesticide Use and whether there is enough rain (Annual Rainfall) to wash it into the river. Finally, the condition of the trees on the river bank depends only on the long term drought and more recent rainfall.
This graphical structure captures these causal interactions:
In consultation with an ecologist we might build the CPTs (i.e., eliciting the parameters from the ecologist), as in these example tables:

Note that Pesticide Use (just "Pesticides" in the table) and Annual Rainfall are so-called "root nodes" with no parents, so there is only a single probability distribution for each, whereas for nodes with parents there is a conditional distribution for each possible instantiation of its parents.
The CPT for the Native Fish Abundance node shows the possible combinations of values for the parent nodes (Pesticides and River Flow), and a probability distribution of the resultant Native Fish Abundance, over the three levels, High, Medium and Low. We can see that the best conditions for the fish are Low levels of pesticide and Good River Flow (.8, .15, 0.05), while the worst are High pesticide use and Poor River Flow (.01, .10, .89). Note also that there may well be other factors in play, such as the presence of exotic predators, or disease, that are not represented explicitly by nodes in the BN. The effects of these are averaged over in the CPTs. They are reflected, for example in the 0.05 probability that native fish abundance is Low, even under the best pesticide and river flow conditions.
Now that we have the BN structure and its parameters, the BN can be used for reasoning. That is, we can instantiate different possible scenarios by updating the values of particular nodes and then updating the BN, using one of the many Bayesian network programs around, such as Netica. First, here is the network with no evidence:

Without making any observations, this BN tells us that the most likely state of the native fish is Low abundance (57.8%), though the Tree Condition is most likely Good (53.3%). If we add observations of the root nodes in the BN, when there is High pesticide use, above average rainfall and no drought conditions, we get:

This reasoning is predictive, from cause to effect. In this scenario, the prediction is that the Native Fish Abundance will improve, due to the River Flow being Good, despite the increased Pesticide in River. Alternatively, the BN can be used for diagnosis, by entering evidence for the Native Fish Abundance leaf node:

Comparing to the no evidence case, we can see that it is less likely that the pesticide use was high, less likely there have been drought conditions, and more likely that rainfall has been above average. Finally, we can use the BN in any arbitrary combination of diagnostic and predictive reasoning; here with evidence entered for both a cause (Pesticide Use being High) and an effect (Native Fish Abundance being High), resulting in (fairly small) changes to the distributions for all the other nodes:

Here I have briefly described and illustrated the usual knowledge engineering process of building Bayesian networks. There is, of course, a great deal more to it when building a real network of any complexity, which you can read about in depth in our book Bayesian Artificial Intelligence. Some of these, including causal discovery algorithms for learning BNs from sample data, will also be discussed in future posts in this blog.
Stanford University has been rolling out a variety of free classes in computer science over the Internet, including now one taught by Daphne Koller: see "Probabilistic Graphical Models".
The Association for Computing Machinery named Judea Pearl, one of the founders of Bayesian network technology, the 2011 Turing Award winner!!! Read the ACM press release here.
— Kevin B Korb†
Sally Clark, in an infamous miscarriage of justice, was convicted of murdering her two sons in the UK in 1999 after a prosecution which employed primarily statistical reasoning in a way that has become notorious as the "prosecutor's fallacy". Here I will briefly review the arguments and the statistical reasoning from a Bayesian perspective. I don't propose the details of this analysis (i.e., the exact probabilities) be taken too seriously. They are taken from fairly cursory searches on the Internet and applied in a fairly crude way. Regardless, they are far more serious than anything produced during the trial itself!
Sally Clark was arrested after her second baby died a few months old, apparently of sudden infant death syndrome (SIDS), exactly as her first child had died a year earlier. According to prosecution testimony (by a pediatrician, Sir Roy Meadow), about 1 in 8543 babies die of SIDS. They argued that there is only a probability of that two such deaths would happen in the same family by chance alone (after controlling for tobacco smoke and a few social factors). According to the prosecution, the woman was guilty beyond a reasonable doubt. The jury returned a guilty verdict, even though there was no substantial evidence of guilt presented beyond this argument.
Let h = Clark is guilty, e1 = the evidence of the first son's death, e2 = the evidence of the second son's death. Note that the latter two are meant to establish the appearance of SIDS deaths. Then the prosecutor's argument was:

- So,

- So,

There are a lot of problems with this argument. Here I will discuss the two most basic errors, which probably have the most impact and which anyone involved with assessing evidence should be capable of recognizing. First, the combination of the evidence in (2), simply by multiplication, requires the two pieces of evidence to be independent of each other. The general form of such a combination is , which further reduces to (2) only if , that is, only if the two items of evidence are independent given innocence. However, risk factors for SIDS are very likely to be common to multiple children within a family, including not the just tobacco smoke and the social factors controlled for, but also poor prenatal care, low birth weights, alcohol consumption and sleeping practices (and, to be sure, physical abuse by parents). In any case, one SIDS death is well known to raise the probability of another in the family;‡ therefore, the combined evidence of two deaths must have a higher probability than their simple multiplication. One study reported a relative risk of recurrence of SIDS of 5 times the background rate, a rate found to be comparable to other recurrent mortality risks in siblings. This yields , instead of .
The second failure in the prosecution argument is the complete neglect of prior probabilities. Bayes' rule says:
For simplicity, I will assume that  , i.e., that guilt would surely produce the evidence found. But, so far, the posterior probability of guilt can still be anything at all: we need the prior probability in order nail down the posterior probability. The prosecutor's fallacy blithely assumes instead that P(h|e) = P(e|h). This may arise because conditional probabilities are often read as "if-then" conditional statements, and these are tricky and easily misread as their reversals. (See, for example, Kahneman and Tversky's work on "base rate neglect".)
Rather than ignore the prior here, however, we should estimate it. The question is something like: how often do mothers murder their first two children within their first year of life? We can answer a more general question, namely how often do mothers kill one or more of their children, of any age. Using this, of course, means we will be overestimating the prior probability by some unknown, but likely large, amount, implying that we are only finding an upper bound to the probability of interest. A news report suggests there are about 100 cases a year in the United States, estimated from surveys of prison populations. Since there are about 120 million adult women in the United States, and about half of them have children, that yields 1 in 600000 murdering their children in any given year. The homicide rate in the US is about 4 times higher than that in the UK (judging by this table), so that gives us 1 in 2.4 million. Of course, a mother may murder her children over the course of many years, but she cannot do so in a way that resembles SIDS beyond the child's first year. She might well get caught over the course of a few years, but using the annual figure alone is almost certainly not as big a factor for underestimating the probability of guilt as counting all cases of mothers killing their children works in favor of overestimating that probability. This way of getting a prior probability is admittedly crude, but it is nevertheless far better than that used by the prosecution, namely ignoring the issue of the prior altogether! Using our assumptions above we have enough to work Bayes' theorem:


This is a fairly high probability of guilt. However, if we were to routinely incarcerate people with a 14% chance of being innocent, we would be doing a lot of damage to society; "beyond a reasonable doubt" surely means that a higher standard is demanded. Some people (especially, some judges) think that the higher standard means certainty and that therefore probabilistic reasoning has no place in the courts. But ignoring probabilities is hardly the same as achieving certainty: it is simply a direct path to foolish decision making, such as that exemplified in the case of Sally Clark. Her case deserved a more serious treatment, including treatment of the relevant probabilistic facts. What actually happened was that an appeals court, despite being apprised of the probabilistic errors committed during the first trial, refused to overturn her conviction. Sally Clark was eventually found innocent after it came out that the prosecution had suppressed evidence showing that her second son died of natural causes. She subsequently died of alcohol poisoning.
Contrary to a widespread view in the legal community that statistical, and especially Bayesian, reasoning should not be considered in court proceedings, it is crucial in many cases that such reasoning be used — but, of course, used correctly. Many people find correct statistical reasoning difficult, but there are ways and means of improving it, some of which we will discuss in this blog. Meanwhile, if you are interested in Bayes and the Law, you might want to take a look at Norman Fenton's project.
† I thank Professor Philip Dawid for bringing this case to my attention and for helpful comments on it. His testimony to the appellate court on this case can be read here.
‡ This is so despite the widespread counselling of parents to the contrary and claims by various studies indicating no increased risk to siblings of SIDS victims! These studies all take pains to control for the kinds of risk factors I've identified above. What is relevant here is the increased risk of SIDS regardless of the cause (excepting those that Meadows actually did control for), and so the risk without controlling for alcohol, etc. is what is of interest. That risk, of course, is increased by the occurrence of a SIDS case in the family (observing an effect of a cause raises the probability of another effect being present!). The contrary claim, by the way, is probably put to parents as a means of reassurance; however, it could easily lead to complacency and to a neglect to deal with the risk factors in place in a family — in other words, made without qualification, the advice is both wrong and irresponsible.
Welcome to Bayesians Without Borders! In this blog we will demystify Bayesian technology and non-statistical Bayesian analysis. Topics will include: Bayesian networks of all varieties and all varieties of their application; Bayesian risk assessment; learning Bayesian networks; naive Bayes models; decision analysis and intelligent decision support; causal reasoning; Bayesian inference and argument analysis; Bayesian confirmation theory and philosophy of science. Applications we expect to deal with include: environmental management; biosecurity; bioinformatics; Bayes and the law. However, the applicability of Bayesian technology is limited only by the applicability of probability theory, so we will go well beyond these examples. (For the occasional technical post, we've added a notation page for ready reference.)
We will be inviting Bayesian researchers and analysts to post here on a semi-regular basis. In addition to your comments on posts, which are highly welcome (but will be moderated), we are open to suggestions about topics on which to post, authors or unsolicited posts from you, which we will consider for publication. From time to time we may also post a precis or review of a relevant book.
Bayesians without Borders is meant to be a call to Bayesians everywhere to discuss, critique and inform the public of Bayesian ideas and methods.
|