Description
In this episode, Lily and David consider foundational data skills in data science education. They discuss Lily’s recent teaching experience at the doctoral training school in Kigali, Rwanda, as part of the AIMS initiative. The conversation explores the significance of teaching basic yet essential data handling and analysis skills to data science students, emphasising how these foundational abilities are often overlooked in conventional education but are critical in the real-world application of data science and responsible AI.
[00:00:00] Lily: Hello and welcome to the IDEMS podcast. I’m Lily Clements, a data scientist, and I’m here today with David Stern, a founding director of IDEMS.
Hi, David.
[00:00:16] David: Hi Lily, you’ve just come back from an exciting trip to Rwanda, so I’m keen to discuss this.
[00:00:22] Lily: Yes, I’ve just come back from Rwanda, where I was teaching in Kigali, or doing some workshops in Kigali with various different doctoral students.
[00:00:33] David: This is part of the AIMS initiative, African Institute for Mathematical Sciences, and they have a doctoral training school in data science, which I was privileged to teach at for the last two years, and I couldn’t make it this year. And you and James stole my place, which I’m really delighted by.
[00:00:53] Lily: I’m very delighted too. It was a firstly, a really nice country Kigali and Rwanda.
[00:00:58] David: You’ve not been to Rwanda before?
[00:01:00] Lily: No.
[00:01:01] David: Oh, it’s wonderful, isn’t it? Rather different to the different places you’ve visited.
[00:01:05] Lily: I’m starting to see more and more, as I go to the different countries in Africa, just how much diversity there is across the different countries. But anyway, that is definitely for another.
[00:01:16] David: That’s another episode.
[00:01:17] Lily: Absolutely. But no, so as you say, I was teaching at the or doing a workshop at the doctoral training school there with roughly maybe around 30 different students that have just started their PhDs.
[00:01:31] David: Yes and it’s always, I should be clear, AIMS is a fantastic initiative which I’ve been involved in for many years, or on the periphery of, and they do well at recruiting top students from across the continent, and there’s some really bright people in the room, I imagine.
[00:01:47] Lily: Yeah, really bright people, and a whole diverse set as well of individuals doing different, completely different areas related to artificial intelligence or data science.
[00:01:56] David: And some, if I’m not mistaken, are working in more statistical areas as well. So there’s this whole range.
[00:02:03] Lily: Yeah, absolutely, yeah. And so then I guess that leads quite nicely into what we were doing, which was introducing basic, or I say basic data skills. I say basic…
[00:02:15] David: No, I love that. I love that way of framing it, we’ll get into that in a minute, I should let you carry on.
[00:02:19] Lily: And these are kind of those data skills, which, again, I say basic, but a lot of these skills I didn’t have after my PhD. Or I didn’t recognize as really vital things by the end of my PhD in statistics. And I’ll describe them as the kind of more simple things as well, they’re not the high level model.
[00:02:39] David: Yeah. It’s not high level modelling. It’s not, people often very much think of the more theoretical aspects, the things which actually, come down to the more mathematical elements of statistics and data science. And that’s what a lot of people have put a lot of focus on.
And as you say you call these basic skills, but I know how much the students are going to have studied with them despite being top students from across the continent and top students across Europe would struggle in similar ways because there’s not that many places which actually give people these sorts of core skills of working with data.
[00:03:18] Lily: Absolutely, and so we had these different students from different backgrounds, they had different levels of previous skills in using statistics and data, but they all, by the end of it, were at a very similar level in a way. Because, regardless of your background, it was okay, let’s go back to the basics.
[00:03:36] David: Sorry let’s actually help other people to follow what we’re discussing, because I know what you’re saying, because I’ve taught that course, I’ve experienced it. What did you do with them?
[00:03:47] Lily: Great, yeah, so we gave them a dataset. We had a simulated dataset, so a dataset which was generated using some code, and we gave them that dataset and we said, here’s a dataset. There were about 10 variables in it, including the kind of outcome variable. And we said to them tell us the model that made this.
[00:04:11] David: Exactly. Now I want to just hone in on that for a second, because real data analysis is always hard. It’s much harder than other people ever imagined, unless they’ve actually got their hands really dirty working with data properly. And this is one of the things we’ve discussed at length in other episodes about responsible AI, about how data lies, or that’s what we’ve often mentioned in different contexts, and how unless you have those skills to be able to work with data, it can mislead you.
So the point is that from a theoretical perspective, we’ve removed all the uncertainty. This is now really just a solvable problem. You can get down to a right answer because it was simulated. And so if you have the right data skills, the problem you’ve given them, which I understand quite well, because it’s the same problem that I gave them previously, this problem is a solvable problem. And that’s unusual and really important.
[00:05:14] Lily: Yeah and I guess just to explain what we mean by simulated or generated data is if I think back at school, I remember working at patterns, you’re told that you have X, which is 2, 4, 6, 8, and Y is 1, 2, 3, 4. What is the pattern there? Well, Y is two lots of X. And that’s just a really simple example of simulated data. So say we just had two variables in our data set. It might be that our outcome variable, our Y, which is our yield, was just double of a different variable.
[00:05:45] David: Yes.
[00:05:46] Lily: But as it is, we had around ten of these with lots of things going on and lots of things to find.
[00:05:53] David: And everything that is in that particular dataset corresponds to things which are observed in real datasets that I’ve personally worked on. So actually, the simulated dataset is really quite representative of a slightly simplified version of what a real data set could be corresponding to this.
[00:06:16] Lily: Absolutely, and I guess, the nice, the simple first example, the simple first trap. So there’s various traps set up, and one of the first traps…
[00:06:25] David: Not traps, they are just features.
[00:06:27] Lily: Okay, fine, features. They’re traps.
[00:06:31] David: They’re not traps, we’re not trying to catch anyone out, we’re just trying to help people recognize features you find in data.
[00:06:39] Lily: Sure, okay, then I’m very sorry because I’ve been calling them traps. Okay. Features. One of the features in there is that one option that the students have from the start is do you want to have the small data set, the medium data set, or the large data set? And the difference in these is just how much data is being generated.
As data sciences students, a lot of them opt for the large data set because they think, more data means that we can, and fairly correctly, assume that more data means that we have a better chance of finding this model in a more accurate way. There’s going to be less kind of noise or randomness in there.
[00:07:17] David: It’s not that there’s less randomness, it’s that by having more data the randomness becomes more observable and you can understand the nature of the random element. The more data you have, the more it just looks, like it comes from a particular distribution because you have enough data to identify it.
[00:07:36] Lily: Yes, yeah, that’s a good point. Yeah, thank you. But then, one of the first things that you’ll find is, okay, I’m actually getting completely different things to my neighbours, the people sat next to me, who are doing other analyses. Or you might find that, okay, I’m getting unequal group sizes here.
[00:07:56] David: Not to give anything away, but might there have been a problem with missing data in that dataset?
[00:08:01] Lily: Yes, there’s absolutely a problem with missing data in that dataset. And, I don’t want to give anything away, but it links through to real world examples and there was then the real world example that I’d give to them for them to try and realise that link of what’s going on here, it was the COVID cases were being counted and they lost something like 16, 000 data points because they were collating all of the data together and I don’t know if I’m allowed to say why but they lost a bunch of data points because there was so much data.
[00:08:31] David: Yes, and so therefore there was a whole chunk of things that were missing, and in particular the data points that are lost, sometimes it’s not by chance which data points get lost. And I think this was exactly what they then realised. If you have a large amount of data and the data points that get lost are just randomly from your data set, it makes no difference. But if they share characteristics which then mean that they get lost, you’ve now inserted bias into your data.
[00:09:01] Lily: Absolutely. Yeah. Yeah. And so the idea of how we taught this and the idea of that kind of week is here’s the data, you find a model, and there’s various features in it to help then find these different things that you can encounter in the real world.
And as you say, all of these features have come from things that you’ve experienced in the real world.
[00:09:24] David: Yeah, when analysing data with partners, where these were genuinely important. And there’s of course a real story behind all those features. And actually the story behind the missing data is really nice because this is just actually the way it was generated had a limit and so it actually the algorithm stopped. So this feature wasn’t actually deliberate. It was by design, but it was just natural because of the way the data was simulated. And so the size of the data was just cut off. The simulation process stopped. And so that led to this sort of, this issue, which is so wonderful and comes out so interestingly.
[00:10:05] Lily: Yeah. We then had this real world example of where it’s happened to give to the students or to discuss with the students, of when it happened in the UK with the COVID cases, which made them realise, oh, okay, this happens, this is a real world thing that could happen.
[00:10:22] David: Yes, exactly. And I think this is something that you get in datasets and therefore that you have to be aware and looking for these things is so important.
Let me just ask you about it because I wasn’t there, so I know what’s happened in the past, but what happened in your case? So they started in that way, then what did they do? How did they get stuck in? Where were their next big learning points?
[00:10:45] Lily: So how they got stuck in was they were given the data. So we spent the first maybe 45 minutes explaining the data, explaining the problem. And then gave it to them and they just got stuck in and they could just explore it any way that they wanted to. Which then I guess leads to the next learning points because when exploring the data was when they came across various different features of it that were in there, such as, okay, I’ve got these different group sizes here from that missing data.
And then also a lot of them wanted to jump straight to modelling. And so it’s great, that looks like an interesting model. So what’s going on? And then realizing to take that step back to actually look at it to be able to interpret their model.
[00:11:28] David: And this is really important because actually quite a lot of them were really experienced, they worked with data, they worked with models. But quite often they might have done models for other people and not had to do the interpretation themselves. So getting that learning out of the model, understanding and interpreting the outputs to be able to actually draw out and say, okay this is what the model tells me. But what does that actually mean about what the model was that generated the data? That was a whole nother step in process.
[00:11:57] Lily: Yeah, absolutely. And that was very fun to watch. It was a very enjoyable week, as I say. And yeah, another really interesting one was where some people did box plots and they were like, okay, that looks normal. Great. But what they don’t see in there is that it’s actually bimodal data. So instead of normally distributed where you have one bit of a high frequency, it’s more bimodal where you have, how do I explain bimodal data?
[00:12:21] David: Where you have two peaks.
[00:12:22] Lily: Yeah.
[00:12:22] David: Your data is really grouped into two, where there’s actually two different things going on, bimodal, two modes.
[00:12:30] Lily: Sure yeah, that’s a good way to put it.
[00:12:33] David: And I think one of the things which is really interesting for me to know is what software did they use? Because this is a free choice.
[00:12:40] Lily: Yep. And we had a whole variety of it. So we had someone that was using Excel at the start. So they were looking at pivot tables and graphs at the start, but then as they were going on, they couldn’t get those higher order graphics that you might want to get with other software such as R or Python.
[00:12:56] David: But did they make progress the fastest at the beginning? Because that’s what I often find.
[00:13:00] Lily: They worked out the fastest about the data set size. So they worked out very quickly, within the first hour, I’d say, whereas for some students, they were still on that. We thought we’d let them know by the end of the first day.
[00:13:18] David: But no, great. I’ve always found when teaching in this sort of way, that learning across students of, some people come in and say, oh, I use this, I use that as if this is better than that. Whereas then through this exercise, almost always there is an appreciation which comes where, oh, you used that and you were able to do that really, I could learn from you.
[00:13:39] Lily: Yeah.
[00:13:40] David: And so on. And there’s that mutual learning, which often happens where suddenly they appreciate each other’s approaches much more because there’s always different bits which different people do better.
[00:13:50] Lily: Yeah, absolutely. And it’s not about the one person has less skills than another so the person with less skills is learning from the person with more skills, say. It’s actually about the people have these completely different skill sets because they’ve got different backgrounds. And often I think that going back, and this is what I mean by going back to those basics, as I said earlier. Going back to those basics, just looking at the data or doing things like pivot tables as they did at the start, just simply scrolling to the bottom of your data set to see if all the data’s there. Stuff like that, which are the real basic things, but the people with more of a background or the people that have worked in data a lot more, overlooked those stages a lot more.
[00:14:29] David: Are you sure it’s the people who had worked with data more or the people who had been coding more?
[00:14:35] Lily: Ah.
[00:14:35] David: I often find that actually people who have really worked with data, they do that. They have those instincts. But people who have been using code to be able to model data don’t. So it depends whether their background is really working with data or having studied models and modelling and so on.
[00:14:58] Lily: Well, that’s a confounding factor in the case that we have that. That’s not something that I picked up on, but definitely something I can believe. Something I’ll look out for next time.
[00:15:09] David: Yeah. You had people in Excel, or somebody started in Excel, what else did people start with?
[00:15:15] Lily: Yeah but a lot of people using R or Jupyter, or Python, sorry, in Jupyter Notebook. That’s what most people are using, and we had some people start to use R-Instat, which is more of a front end. It’s this free open source front end to R, where you don’t have to worry as much about the code, but instead can look at the data and those processes of those statistics a bit more.
[00:15:37] David: Great. And any other softwares creep in there? Or were they the main ones?
[00:15:43] Lily: They were the main ones. If there were any others they’ve slipped my mind. I don’t think that there was Stata or Minitab or any of them.
[00:15:49] David: Interesting. I’ve had those in previous years where students have had them on their machines and actually get started in them. And that’s always interesting as well. But it is quite common that actually it’s the open source software which really dominates in the African context. This is why R and Python is so important.
[00:16:05] Lily: Yeah, yeah. When we gave them the task of, okay, find this model, you’ve got a week, they all thought, or a lot of the students thought, okay, yeah, I’ll do that, you’ll see, I’ll get that. And then after a few hours, James, who I was there with, would say to them okay, has anyone got the model yet? And they all laughed because they’ve realized the more they’re looking at it, okay, there’s a little bit more I need to do here than I realized initially. And I think it was one of those problems where the more they did, the further away from the answer they realised they were.
[00:16:42] David: I have to confess, when I first taught this course, in this way I had three problems lined up in case some people got to the end of the first one rather quickly. I had two others lined up ready to go, which of course they never touched within the week.
[00:16:59] Lily: That doesn’t surprise me. They spent a lot, they got really stuck into it though and that was what was really nice, how much they were really looking at it and the different techniques they were using and alongside us was other courses being given. So we tried to tie it into those courses as much as we could.
[00:17:19] David: Yeah, as you say, it’s a really, I always find it a bit of a nerve wracking course to give, because you never know what’s going to happen, but it’s really enjoyable to see how people interact with it.
[00:17:30] Lily: Yeah, and enjoyable because it’s not always going to be the same course every time, because the students are the ones guiding it. I think it’s called student based learning. It’s a kind of technical term for it, I think. But yeah, the students are the ones that are guiding it. They’re the ones that are saying, okay, I found this and I found that. And then from there, you kind of get them to show other people.
[00:17:50] David: Yeah, and then you can dig into actually what are you observing about that? The one that I’ve always found, which you may or may not have found, was sort of ANOVA tables, where suddenly somebody does it, and they pick something up from an ANOVA table, and then everybody realized that they knew about ANOVA tables, but they never really understood them, or how to use them, or what they could learn from them.
[00:18:10] Lily: The plan at one point was, we spoke at one of, one of the evenings after teaching, me and James, who I was giving it with, and we were saying, okay, maybe tomorrow we’ll try and see if we can then do a bit on ANOVA tables, or we’ll see if people are doing bits on ANOVA tables, and we can stand up and explain ANOVA tables.
And that never happened because some students, just at different levels of students, some of them were not anywhere near there with the ANOVA tables. So the students that did the ANOVA tables, we sat down and we’re like discussing it with them and a lot of them were very kind of asking for extra resources afterwards.
We’re very lucky that we already had a lot of resources that we had made related to previous projects. But absolutely, there was no student there that understood the ANOVA table completely. They would all look at it and point at the p values and go, there you go, this must be it. Yeah, okay, but what’s this say? How does this ANOVA table here link to your graph or your descriptives that you found?
[00:19:13] David: Yes, and that link between them that, oh, it’s so nice to see those links, those mental links being made. But it really worries me that these are really bright students who are doing amazing work, their PhDs are on really inspiring projects, and yet some of these basic skills of understanding data are not there.
This is something where I found that surprisingly, and you said for yourself, you had a very good PhD in statistics, and yet even your education meant that there were elements of this where you were surprised by some of this when you were first approached with this problem.
[00:19:53] Lily: You first gave me this problem.
[00:19:55] David: I know.
[00:19:57] Lily: And yeah, no, absolutely. And again, also just working on various courses and writing various courses, coming to realise those links between different things, and how actually, maybe I’m going to shoot myself in the foot here, but statistics isn’t that hard.
[00:20:18] David: I really like the way you phrase that. And I absolutely agree in some sense that there’s elements of how statistics is taught where people think of it as being overwhelming and very difficult. But a lot of the basic concepts, it’s just that they haven’t assimilated what they actually mean and what they’re actually saying at interpretation. And so it seems a lot harder than it sometimes is.
[00:20:40] Lily: And I go through waves of these feelings. Some days I’m like, there’s not much statistics, really, is there? And then other days I’m like, there’s so much statistics.
[00:20:49] David: I guess the point which is, I would, I’m afraid I’d lean towards the latter. There is a lot.
[00:20:55] Lily: There is a lot.
[00:20:55] David: But the basics, and this is where we come back to the basics. If you have the basics you’re 99 percent there. It’s only that 1 percent of the time that you need all the advanced stuff where all the work is happening. But most of the time, you get most information from what you get with really good basics.
[00:21:19] Lily: Absolutely.
[00:21:20] David: I must say, both of us are mainly talking about basics, maybe as being bigger than most people would consider basics.
[00:21:27] Lily: Sure.
[00:21:28] David: There’s quite a big amount of that, but it’s still, it’s not that much, and that basic stuff is exactly, as you say, there’s not as much to it as some people think.
[00:21:36] Lily: And, by basics, a lot of it is stuff that you do at primary school, like you learn the mean, the median, the mode of, I remember learning them and different graphs. Okay, I don’t think you do box plots at primary school, but the kind of, the fun stuff, the much more fun stuff of getting to visualize it, getting to see it.
[00:21:56] David: Exactly. But I think a lot of these things, if you take those basic tools, you take the idea of the box plot, and exactly as you say, that this can hide things. All of the things, all of the summaries you’ve mentioned in the box plots, these can hide things.
In R you can just swap out the box plot for violin plot, which is density plot. And then suddenly, your bimodal distribution becomes visible. And then actually understanding, just increasing that toolbox a little bit, and understanding why you might want to do that, how that might depend on how much data you have and so on.
Those are sort of things where there’s instincts which then get built in. And I would argue that the education system we have isn’t great at building those data skills. And that’s maybe where we want to finish off just discussing these ideas about data skills.
[00:22:43] Lily: Yeah, and bring it back around to I guess responsible AI in how is relevant for today. But those data skills, getting those data skills right is what you need to get to be able to build, um, to be able to look at data. As you say, it’s 99 percent of it is those basic, not basic, but…
[00:23:05] David: Yeah, what we would consider relatively basic skills, data skills, I would argue where it is impossible to be responsible if you don’t know them. This is the key thing, because with those basic skills you won’t get misled. Yes, you might not have the skills for a very specific study to be able to find the right way to make sure you’re confident in the results you’re getting. There’s really deep statistics which is sometimes needed because of the complications that come in into particular studies or in different ways.
I mean you did your own PhD related to missing values in specific ways. You therefore understand that extremely well. But it’s not that needed most of the time. When it’s needed, it adds value. I’d put that in the 1%.
[00:23:55] Lily: Yes, okay. And we can’t all know that 1%?
[00:23:58] David: No. And we don’t all need to know that one percent. That’s why we get experts like you. And you don’t need to be an expert on everything in the one percent, but you’re an expert on something in the one percent. And that’s the key point. There’s actually that one area where you’ve gone more deeply, you are then an expert.
And so if you can do most of it, then, you know, for most other people, when they need to go beyond that, then they should look for an expert. And yes, you do get people who get expert at quite a lot in that 1 percent. That’s where, as you said, suddenly there’s a lot there to learn and there’s a lot of different things.
[00:24:31] Lily: Yeah and going back a step to the what’s important about these data skills? I suppose then links with responsible AI and where actually a lot of things have gone wrong or a lot of scandals have come out because of basic skills not being there in the algorithms in loans, in automated loan approvals, there’s been bias in there because the process is discriminating against your minority borrowers, there’s considering these different factors which might correlate with something else, such as your race or your educational background, which is leading to this kind of unequal treatment. And that’s coming from not looking at the data properly in the start.
[00:25:11] David: Yeah. And not considering the structure of the data in the right way.
[00:25:16] Artificial Voice: Here’s what I found.
[00:25:17] Lily: Oh good, my watch is talking to me.
[00:25:20] David: That’s just the AI trying to come in on the conversation. It heard we were talking about it.
[00:25:23] Lily: It actually was, yes. Yeah, the watch heard us talking and it was like, oh, I can add into this conversation.
And so these kind of skills, it’s surprising that the kind of these data scientists or these kind of data scientists who are starting their PhD don’t have, or don’t know these basic things, but then I didn’t either at the end of my PhD.
[00:25:51] David: I would argue, more than that, what I think, rather than saying it’s surprising, what we should be saying is, if we want to be working towards a society where AI is used responsibly, we need to make sure that these skills aren’t coming in for a few people when they’re starting their PhD. They’re coming in really widely to data scientists all over the world in all sorts of different ways. And that’s something where, I know the Turing Institute has been doing work and we’ve worked with them on this a little bit, but there are other groups trying to help with this, but it is a really challenging issue.
And it’s something where it’s not going to be game changing, but unless people actually start with this, getting to responsible AI is just not going to be possible. And this is what I would argue the situation we’re currently in is, we haven’t got enough effort going in to building basic data skills into lower levels of education, particularly for data scientists, but I’d argue more generally for everyone, so that as a society, we can be approaching these things more responsibly.
I guess that’s a rather sombre way to finish this episode.
[00:27:06] Lily: You like to end on a positive.
[00:27:08] David: I normally like to end on a positive. I think that there’s a real opportunity.
[00:27:13] Lily: There we go.
[00:27:15] David: To try and get these skills out really widely, to integrate them into education to all sorts of different levels.
[00:27:21] Lily: And you’ve said before to me that places are doing that, you’ve said a few times to me that in New Zealand, you’re really impressed with their educational structure when it comes to statistics and how they’re teaching these.
[00:27:32] David: And what’s really interesting, of course, is that they’re, they are more teaching data skills than statistics, but they’re doing it from the first year of primary right the way through secondary. They’ve been integrating this in since about ‘95. So they’ve now got almost 30 years of experience trying to do this. And in that experience, they’ve found problems with what they were doing and they are now improving it and they’re hopefully getting towards real learning that others could do.
New Zealand, I would argue is somewhere where we could go, just as people used to go off to Finland to look at the schooling systems, we should go off to New Zealand, we should have people going off to New Zealand to try and understand what they’re trying to do integrating data skills into schools from an early age.
And that is exciting. But that’s a long term project. I would argue I’m more excited by the fact that you would argue that you, in one week, were able to move the dial for these students and it wasn’t a full week because they were doing other courses at the same time.
[00:28:34] Lily: Yeah.
[00:28:34] David: And really it’s not a huge intervention which is needed. But it is this element that at scale we should be trying to get these sorts of problem solving skills, thinking about these experiences out to people. And so it’s not just us trying to do this, there’s plenty of other people out there. But actually, aiming to get data skills much more widely available, this is something I’d love to see, and I think there are groups, the International Statistics Institute is interested in this. It’s a hard problem, don’t get me wrong, but I think it’s something where there is an appetite in certain circles to try and get these skills out more widely.
[00:29:19] Lily: Do you have anything else you want to say before we finish?
[00:29:22] David: I want to finish on an optimistic note, so I better not say anything more.
[00:29:27] Lily: Thank you very much, David.
[00:29:29] David: Thanks