We employ data scientists to collect data from disparate sources, wrangle, and cleanse it into a meaningful shape. They have the skills to statistically analyse the data and produce coherent results.
So, what’s the problem?
Firstly, even if you do have access to someone with this very rare and unique set of skills, their time will be in high demand. But there is a much more significant problem. The most important information about an organisation rarely sits with the data expert. The knowledge sits with the domain experts. It is this knowledge that must be used to solve the most significant problems. Statistical learning should never be used to replace the domain expert; instead, it should be used to support them. pi Analytics can be used to automatically search through your data and show you instantly what the most important characteristics are.
Putting it into practice
Let’s start with a simple example. You are presented with two mushrooms. Which one would you eat? Would you eat the mushroom?
Using Business Intelligence to make the decision
This could be a business decision. Making the wrong choice will be harmful. Without any business intelligence, we must make our decision based on gut feel. Maybe we would choose to eat the brown mushrooms. But why? It’s likely because we have some preconceptions. Maybe red is associated with danger – or is it that movies vilify red mushrooms in the same way they do with wolves and sharks? Let’s build some traditional business intelligence and create a nice chart.
Now we can see that 58% of red mushrooms are poisonous, compared to 45% of brown mushrooms. Armed with this information, do we feel comfortable about deciding which mushroom to eat? In this case the information we have just seen probably makes the decision harder. This is a good thing because, if we went with our original gut feeling and ate the brown mushroom, 45% of us would be on our way to the emergency room. We can examine the data set further. Let’s look at some other mushroom characteristics and change the chart to look at cap shape.
Now we can see that ‘bell’ and ‘knobbed’ look significant. If the mushroom has a bell shape, there is an 89% chance that it will be edible. This is great news, but what if neither mushroom has a bell-shaped cap? The total number of mushrooms in our data set is 8,124 and there are only 452 bell-shaped mushrooms, which means they don’t occur very frequently. This is going to be a problem. If we find a significant characteristic, but it doesn’t occur often, then it’s not going to be helpful. I can keep searching and build more charts and tables. I may continue to find interesting characteristics. However, I’m not sure this will help me easily decide which mushroom to eat. In fact, the more I search the more confused I get.
It is hard to determine:
- What the most important characteristics are.
- If they apply to a wide range of mushrooms – i.e. will I be able to apply it to the mushroom I have to make the decision about?
- How all the different characteristics interact; is it a combination of characteristics that will provide the best solution?
There is a solution: machine learning.
How does machine learning help us continually improve?
Using pi Analytics, we can quickly create an analytical chart. We can bring together all the characteristics from the data we have. To do this, you first select the Analytics chart type. This is only available with the pi Analytics module. Then you start adding the characteristics. The first item must be a unique identifier, which is something that identifies each row in the data. In this data set it is the mushroom ID. Then I add each characteristic I would like to test. In this example, I add every characteristic that has been collected. So here we have 22 different mushroom characteristics.
Obviously, in your data this would be your characteristics, so if I was measuring student performance then it may be characteristics like age, gender, late marks, etc. Now we need to add and mark an objective. The objective is what we would like to make the judgement on. Here, it’s whether the mushroom is poisonous or not. In your data set, this is what you want to measure. This can be text or a value. If it is a value, the model will try to separate large and small.
Now all we need to do is push the build button. You will then be presented with a decision tree. This image is represented as a Sankey diagram. The colour of the segments and the lines represent the percentage of the objective values falling into each segment. The width of the lines represents the proportion of the sample that fall into each segment. We want to find large population splits that can indicate the largest differences.
Interpreting an analytical model
The top grey node shows us the overall population. It states there are 8,124 mushrooms in the data set. Overall, 52% of those are edible. The first node states that the most significant characteristic is odour. The left branch shows that 0% of 3,796 mushrooms are edible. The right branch shows that 97% of 4,328 mushrooms are edible. This instantly shows us that if we pick up and smell the mushroom, we can jump to a 97% chance of eating an edible mushroom. If we look at the detail, the mushrooms that are edible smell either of almond, anise, or have no smell. Poisonous mushrooms smell fishy, foul, musty, pungent, spicy, or of creosote.
So, if the mushroom smells OK, then eat it. Evolution has provided us with a very comprehensive sense of smell. We would not have survived to this point had we not been able to smell things and decide whether to eat it. Let’s explore the model further and follow the right branch down.
The next characteristic is spore print colour. We can now improve our odds of eating an edible mushroom to 99% by avoiding the mushrooms with a green spore print. So now we have a model: smell the mushroom and check the spore print. What is interesting is that the model is not telling us to look at the colour of the mushroom’s cap.
Armed with all this information, would you still eat the mushroom?
But… and this is a big but… This is a model, and all models are wrong! (George Box https://en.wikipedia.org/wiki/All_models_are_wrong)
So I’ll pose the same question now as I did in the beginning. Would you eat the mushroom? It probably depends how hungry you are. What matters here is whether the model is useful. I’d argue that it’s quickly done a couple of very important things:
- It has challenged our preconception. We no longer look at cap colour.
- It has given us a model that we can use.
Combining domain expertise with machine learning
There is something very special about pi Analytics. It brings the power of machine learning and people who hold the domain knowledge. However, one reason we are still not ready to make a decision about eating mushrooms is that we still don’t know very much about them. If we tap into your domain knowledge, then something exciting happens. The model may not be helpful in itself, but if we showed this to a mushroom expert – or got them to build it for themselves – they will tell us very quickly how useful the model is, why it’s right, or why it’s wrong.
They may tell us why our sample is no good – maybe our data comes from America and we are trying to eat European mushrooms. They can quickly tell us things about the model which data scientists struggle to tell us. Statistics tells us about correlation but not causation. It cannot tell us why this is happening, but the domain expert can. We built one model; what if the expert knows that smell is just not useful because they know people find it very difficult to determine or provide an accurate smell decision? We can now quickly remove smell from the model and rebuild. You can quickly remove a characteristic (or add one). Let’s remove odour. Then rebuild the model.
pi Analytics now completely rebuilds the model; all remaining characteristics are re-evaluated. Let’s look through the tree. Spore print colour is still highly significant, but now the second criteria on the right branch is gill size. So, 3,392 edible mushrooms have a spore print colour that is either black, brown, buff, orange, purple, or yellow and the gill size is broad. We can keep exploring the data set and add and remove columns, using our domain knowledge to find a model which is useful.
So, why mushrooms?
So why did I choose mushrooms for this example? Well precisely, because I can once again ask you the question: Would you eat the mushroom?
PLEASE DO NOT USE THIS MODEL. DO NOT GO OUT INTO THE WOODS THIS WEEKEND AND SNIFF AND EAT A MUSHROOM. YOU WOULD PROBABLY BE VERY ILL OR WORSE.
If you do know a mushroom expert, then ask them what this means – if they want to explore the data then please get in touch.
So where does this leave you?
You have data, you have characteristics, you have objectives, and most importantly you have the domain knowledge. Let me ask you some questions.
- What is the probability that a student will achieve their potential grades?
- What is the probability that a patient will attend an appointment?
- What is the probability that an engineer will fix the problem in one visit?
- What is the probability that your heating will fail?
- What is the probability that a customer will purchase from you?
- What is the probability that a bug fix will fail at the regression test stage?
- What is the probability that a customer will crash a hire car?
- What is the probability that a customer will renew their gym membership?
- What is the probability that a patient’s x-ray will enable diagnosis?
- What is the probability that a prospect will respond to a marketing campaign?
These are all questions that we are starting to answer.
What could you understand in your data?