071 – Introduction to R-Instat – IDEMS International Community Interest Company (CIC)

The IDEMS Podcast

071 – Introduction to R-Instat

00:00 / 24:25

Description

R-Instat is a free open-source statistics package. In this episode, Lily and David introduce R-Instat and discuss the motivations for creating it, based upon the need for a user-friendly, comprehensive tool for low-resource environments that can handle multilevel data efficiently. They consider its broader implications for data literacy and statistical education, emphasising the ongoing ambition to make complex analyses more accessible and intuitive.

For more information, see r-instat.org.

Transcript

[00:00:00] Lily: Hello and welcome to the IDEMS podcast. I’m Lily Clements, a data scientist, and I’m with David Stern, a founding director of IDEMS.

Hi David.

[00:00:14] David: Hi Lily. We’re discussing R-Instat today.

[00:00:17] Lily: R-Instat! So R-Instat is this free open source statistics package that predates IDEMS. You started working on it nearly 10 years ago, which I cannot believe it’s been that long. And I joined a year after that in 2017.

[00:00:32] David: 2016 I think we joined.

[00:00:34] Lily: It was 2016, you’re right. Yeah, time goes fast.

[00:00:37] David: Time flies, yeah.

[00:00:38] Lily: Of course.

But it’s this front end to R and it’s a statistics package, so I guess it would be useful, I guess, to outline initially why do we need another statistics package and why do we need that front end to R? There’s a lot out there already.

[00:00:53] David: Absolutely. Before R-Instat came about we didn’t believe another one was needed. In fact, I would go to trainings and people would say, which statistics package are you using? I’d say, there’s loads out there, there’s loads out there.

And it was actually at one of those trainings that we went through and we looked at all of them with a group who we were supporting and we found that none of them were fit for purpose for them. You know, you have the commercial ones and many of the commercial ones we thought were very good and they could have been fit for purpose, but they couldn’t afford them.

And the open ones for all sorts of different reasons, none of them actually matched their needs. Some of them were just too complex. They were assuming you would learn more than you might want to be able to do and keep track of things in ways which weren’t appropriate to the sort of relatively low skilled audience we were trying to serve. And others were too niche, they were just doing some things, but you quickly fell off the end. And there wasn’t that comprehensive stats package which we felt we could recommend, this is what was needed.

And so, inspired by a colleague in Kenya, James Keleli Musyoka, we said, okay, we’ll try and get this built. We’ll try and get it built in Africa for you and other universities and these audiences, including the climatic audience that we were serving.

What was interesting, of course, is quite quickly we found out that there was a whole nother reason a new front end was needed, and that’s that actually when we looked at the best of what was available, including the commercial ones, for inspiration, we found that they weren’t using all the things that could be used right now. And there were advances which had happened particularly in relation to multilevel data, and thinking about sort of how you could analyse data on a database, rather than just a single rectangle of data.

[00:03:00] Lily: So let’s just go back a step, so what do you mean by multilevel data?

[00:03:04] David: If you have any data, real world data, right now, it is on multiple levels. If you have data, let’s say you do a survey, and you go and you talk to a lot of people in different places, you go to the household and you interview people, and within the household you have data on the individuals.

[00:03:25] Lily: Yeah.

[00:03:25] David: You have data at the household level. You have data probably at the village level. You might have data at the county level, and if it’s a multi-country data, you might have data at the country level and so on. And so you almost always have data at multiple levels. Many of these big surveys which are open and accessible, you end up having at least six levels of data just in that sort of standard way of thinking about it.

[00:03:50] Lily: Sure, and so when you’re using this data you can then analyse it at these different…

[00:03:55] David: Absolutely, and people think about multi level as being complex and hard because they think automatically of multiple level modelling. And that does get hard once you start thinking about the error terms being at different levels. There being variability at the individual level and the household level and the village level. Now your modelling is much more complicated because you have these random effects at different levels and so on. And yes, that modelling is more challenging than if you only considered variability at individual levels.

But multi level is not just about the modelling. Multi level is simply about the fact that data naturally lives at different levels. I mean, one of the examples I love is in pretty much any questionnaire, you have questions which should lead to multiple level data. Very simply, imagine you’ve just got a simple questionnaire, you’re just asking a set of people a set of questions, and one of the questions you’re asking them is which of these newspapers do you read? How do you want to store that data? You now have multiple responses.

You can store that and make your data very wide, where you have a single column for each newspaper, but really the correct way to store that data is that you’d have three levels. You’d have the data at the individual level.

[00:05:22] Lily: And that individual level will say?

[00:05:24] David: That individual level data would be, let’s say, the person answering the questionnaire. And it would have all the data of the other questions they answer.

[00:05:33] Lily: Right. And in there it would say and they read this set of newspapers.

[00:05:37] David: No, I mean the whole point is that in there you would actually say that the newspapers, that question doesn’t belong at the person level, that belongs at the person newspaper level.

[00:05:51] Lily: And so that’s another level. Okay, so we’ve got our individual level.

[00:05:54] David: Yeah, our person level.

[00:05:55] Lily: Yeah.

[00:05:56] David: Our newspaper level, where you’d have a little table which would just have information about the different newspapers, and just that there are these different newspapers. And then you’d have a table which would refer to the person level and the newspaper level and which would say this person reads this newspaper.

[00:06:14] Lily: Okay, yeah.

[00:06:15] David: And that’s the efficient way of storing the data. It’s the correct way to store that data. You can now do useful things like whenever there’s an other, you could add new things to your newspaper table, and so on. So there’s all sorts of things that you can do to make that data live in the right way if you think about it as three tables instead of one.

[00:06:38] Lily: And so how does R-Instat facilitate this use of kind of multi level data?

[00:06:43] David: I would argue we’re still in our infancy. But at the heart of R-Instat are structures in R which essentially act like a database and allow you to have this data which is multi level and work on it in constructive ways.

A very concrete example of this, which is very practical, is that if you have, let’s say, we work a lot with climate data. Let’s say you have climate data, which is daily data. You have the rainfall on each day. And then you want to look at summaries of it. You might want to look at a number of summaries, let’s say, what’s the average rainfall per year? Let me be careful on what I’m saying. You could get, let’s say, the average rainfall per rainy day in a given year. You could get the total rainfall that fell in a given year. You could get, let’s say, the total seasonal rainfall.

How you look at that, what it is, that summary statistic, there’s lots of different things that could be of interest. And so you might want to get a few of those, you get your summaries, and you now have a new table, where instead of having a row for each day at each station, you’d have a row for each year at each station.

[00:08:05] Lily: Yeah.

[00:08:06] David: And then, let’s say you now want to do some more summaries. Well, if you were doing this in code, you would take those extra summaries, and once you’ve got your summaries, you’d get a new table, and then you’d merge them with your other table.

[00:08:22] Lily: Sure.

[00:08:23] David: But if you’ve got the right linking, you should be able to get it to go straight in to where the other summaries were. And that’s exactly what we have structures which enable this to happen, and the data systems, the working at multiple levels, means that it knows that if you were to take summaries from this data in this way, they will go into that table that already exists.

[00:08:47] Lily: Excellent. It’s like in Excel how you have multiple sheets.

[00:08:50] David: Yes.

[00:08:50] Lily: But when you create new summaries or if you add to your summaries and you’re doing it by, say, station and year again it now, let’s say you’re doing your temperature summaries, it adds that in to the same data set that you already had your summaries in.

[00:09:04] David: Exactly. And so it’s a trivial thing, this is like just automating the merge. But it is symptomatic of what’s so important because the fact that those are linked is really important in other ways. This means now, for example, let’s say you want to do a graph for example, you can know well, what are the summaries which you could actually use to populate on a graph? Because sometimes you might want to have a graph where you look at what the individual values are and then you’d put the average on, or some other summary.

And so now you know where you can get them from. And so in theory, these things can all be linked. Now, we’ve not taken this very far yet. It’s only been ten years since we started building these ideas. But what we recognised fairly early on is that the fact that you think about analysing data across levels changes the way the software should behave and what the software can do and how you think about this.

And the reason this is so important is we recognise that if we can take that further, this could mean that a lot of the challenges that people find analysing their data could be reduced. I’ll give you one very specific example. These big surveys, household surveys which are done by the National Stats Officers. One of the things which happens is that in low resource environments where people could use these surveys in really powerful ways, although the data is in theory open, it’s not being used as much as it should be. One of the reasons it’s not being used as much as it could be or should be is because the skill to be able to get it ready for analysis is actually rather hard.

And one of the reasons it’s rather hard is because you often want data which is at different levels, brought together depending on your research questions let’s say. The data, you know, these questionnaires have become huge, and one of the reasons they’ve become huge is because a lot of the questions, actually, instead of being stored, as we’ve discussed, this choose multiple questions, where you can choose multiple answers, they now become 20, 30 columns.

[00:11:27] Lily: Kind of, your dummy variables saying yes or no.

[00:11:31] David: Exactly, yes.

[00:11:32] Lily: Daily Mail yes, no, Sun yes, no.

[00:11:35] David: Exactly, and so therefore you have these very wide datasets.

[00:11:38] Lily: Yeah.

[00:11:39] David: Now, if you think about that differently, all of those can and should be taken out of there, and then they could be replaced by a summary column, which is then linked to where the raw data is. And so what belongs at that level might be a summary, or it might be a list, a column containing a list. There’s ways in which you could then link that back to the raw data. And then those datasets would now, instead of having so many columns, would now just have the same number of columns as questions. Every question would give you one column in that data set.

And the raw data would then be left in these linked tables, but you don’t need to worry about them as much. So it’s now much easier for people to navigate because if they have the questionnaire, they have the same number of questions as they have columns.

And details like that would make a huge difference to making it easier for people to navigate, to work on this. And having them linked from one level to another would mean that you’d take the data from the different levels and you wouldn’t need to move it to take this question which was asked at the sort of village level and make it into a household variable. You could leave it at the village level because that’s where it belongs and analyse with it there through linking.

I feel I’m getting a little lost, but I feel that the point is that if we could make it easier for people to be able to tie the nature of the data to the way that they’re analysing it, it’ll be so much more accessible to a wider audience to be able to work on it.

This is a, I would argue a statistics education point. This is something which is related to data science. This is something where when I’m even thinking about some of the learning and some of the scandals that we’re hearing from the artificial intelligence, I think some of these relate as well in different ways to how you think about multilevel data and how this could actually, or how thinking about data in a different way could make this so much more accessible to a wider audience.

And, I guess, one of the hopes that I have over the next few years as we continue to develop these ideas and hopefully eventually bring them into the software to come to fruition is that what we could see is that the data literacy that you need to be able to engage with data could come right down, it could become so much more accessible.

There is complication in data. If you take summaries in different ways, just thinking about proportions, proportion compared to what? What’s a hundred percent? That’s a hard question, which people aren’t thinking about enough because it’s not easy enough to get people to the stage where they can be thinking about things which are genuinely hard.

To come back to our newspaper example, do you want the proportion of people who read a given newspaper? Do you want the proportion of newspapers which are of a given newspaper? There’s so many different things that you could ask! You want to know what the proportion of your newspaper is. What do you mean? Who is it the proportion of? If somebody doesn’t read any newspapers, do they count? Do you just want the proportion of people who read newspapers to be the hundred percent? And so this is, you know, who choose to read that newspaper as opposed to other newspapers.

There’s so much complication which is needed if we think about the literacy. And a lot of that at the moment is not being widely thought of enough because the tools we have for people to interact with it are essentially not supporting us to easily be asking the right questions of our data.

[00:15:51] Lily: Sure, so what we have here is R-Instat kind of recognised in a way that multi level data is a lot more prominent now than it was many years ago when, the likes of SPSS and SASS were developed. And R-Instat is taking this into account and, actually, it seems like such a small thing, multi level data.

[00:16:12] David: But it’s always about the details. And the point is, it’s not important if you’re writing scripts, it’s not that important because you can always take data from one level and push it to another level if you’re doing scripts. So in terms of the analysis, it’s not that people aren’t doing these analysis. It’s just that people, are needing a relatively high level of skill to be able to routinely do these analyses.

[00:16:35] Lily: Yeah. Whereas R-Instat is trying to make that easier.

[00:16:39] David: Exactly. And we’re not there yet. I mean, let’s be clear, there’s been some really good work in the back end, which is going to make a lot of this possible. There’s more to be done. There’s been a lot of work on individual dialogues. And this is the other thing, which I think is so interesting, that, some of the front ends to R have taken the route, which is very sensible, that dialogues correspond to functions. And you create then the language which corresponds to your dialogue. And they match onto one another.

And from a coding perspective, in many ways, for a front end system, this makes perfect sense. This makes your life so much easier.

[00:17:25] Lily: But then you look at things like Stata, where you can do your code or you can use the dialogues, and your dialogues become, your dialogues themselves are fairly simple, but your menus become very overwhelming.

[00:17:38] David: And I think your example of Stata on this is a really good one, because this is one where they’ve done exactly that. And it works, the language is very good. But as you say, the dialogues are often very overwhelming, and they’re not natural, and they’re not intuitive. And that idea that actually, you do need different things when you’re doing the dialogue and when you’re doing the language is an important observation that we made very early on.

And so that separation of thinking about the design of a dialogue from a user’s perspective of the dialogue versus the design of the dialogue with the design of the language that you get out, and what actually makes sense in terms of a language. Separating those out is so important. And so being able to create a little script from your dialogue, my favourite example of that is ggplot within the tidyverse, it’s such a beautiful language. Putting that into a format where you can actually create dialogues around it, is tricky and actually though we have a general dialogue which broadly maps that onto in a relatively simple way, I’d argue.

[00:18:56] Lily: Yeah.

[00:18:56] David: But we also have then other dialogues which are not about using ggplot, but they’re about I just want to get a scatter plot, I just want to get a line plot, whatever it might be, a bar chart, yeah?

[00:19:13] Lily: Yeah.

[00:19:13] David: And sometimes that’s all you want to think about, being able to do both of those. So the design of the dialogue being separate to the code, they both produce the same code, you can get the same code from different dialogues, that was another deep insight. We actually have, even in the menu system, the same dialogues are sometimes accessible from multiple places. These design features, when you’re thinking about the front end, are important. And that’s all part of actually thinking about the complexities that come in behind this. And I suppose this is why it’s taken us a long time to get to where we are because the project’s become so much more ambitious.

We really believe now that this isn’t important just for low resource environments. There is something here which is needed across the board, to get the sort of data literacy I want to see in the future. We need software which is going to enable that. And I don’t see R-Instat instead of all the other options that are there. But I do see it as well as. I can see it bridging into other options and just being part of the mix in a way which I hope over the next few years we’ll really get to see this come to fruition. It’s getting closer than it was. But still, if you look at it, it looks out of date.

And that’s all cosmetic in a sense, and that’s something which we’d love to work on eventually. But it does matter. You look at some of the other front ends which are much smoother. And that’s the technologies they’re using. And that choice of technology was a really important part of actually making this accessible in low resource environments.

But the price to pay for it, or the price we have paid for it, is that it’s not aged as well as we’d like in other ways. So the plan is to of course get this available in web technologies now, which have advanced to the extent that we could now, if it was developed in web technologies, create the offline versions, which could then just be installed in ways that they couldn’t be 10 years ago.

[00:21:25] Lily: Yeah, it sounds like there’s been a lot of kind of developments over the last 10 years, starting from being a software for use in low resource environments, and in educating, and good statistical practice, to then realising, oh actually this good statistical practice isn’t about just in education in a way, right?

It’s also about okay, now we can change the structures, we can have this multi level in here, this linking in here, and about how you don’t let the code decide the dialogues.

[00:21:52] David: Exactly, yes.

[00:21:53] Lily: You let the, kind of, you decide the dialogues, and in a way it says to me that with R-Instat, the R code’s kind of secondary.

[00:22:02] David: No, it’s not. This is what’s so interesting, because it’s not just that you don’t let the codes define the dialogues, because other people have done that, but they then let the dialogues define the code. And so you then get code which isn’t right either. So the key thing is that actually, The separation that we’ve got means that you can actually have a good code coming out of dialogues, which looks very different.

So you can actually get a dialogue which might be very simple, but ends up producing really complex code. And, you then get another dialogue, which looks really complex, where the code that comes out actually is rather simple. That fact that, actually, you’re not putting one on the other, the code which is created is independent from the nature of the dialogue that’s creating it. And that’s a really powerful… It goes both ways. The R code is the source of the truth. But the system that we’ve built means that the dialogues themselves can author R code and interact with R code. And there’s some really powerful features on the horizon where this would actually feed back in different ways.

[00:23:12] David: I’m conscious that actually what we should do is have a follow up session and so we should call this a day because we could carry on forever and we should actually have another one where we dig into some of these underlying structures. We dig into some of the R code behind it and why this is thought of differently and what this will mean. Because it is something where what I would believe is that we will get to new R language, a bit like RStudio has become, through the Tidyverse, one of the drivers of the R language, or a part of the R language.

I think that what we’ve found from the developments in R-Instat is that there are other ways that language could be driven forward, and we’re looking for that to really play out. So another discussion on that.

[00:24:00] Lily: No, absolutely, and I very much look forward to that discussion as well.

Thank you very much, David.

[00:24:04] David: Thank you very much.