086 – The R-Instat Calculation System

The IDEMS Podcast
The IDEMS Podcast
086 – The R-Instat Calculation System
Loading
/

Description

Lily and David discuss the evolution of R-Instat, an open-source front-end statistical software for R. They explore the development of the back-end calculation system, which aims to simplify complex data analysis and enhance reproducible research. David explains how R-Instat addresses the challenges of multi-level data and automates calculations, making it accessible to users who may not have coding expertise.

[00:00:00] Lily: Hello and welcome to the IDEMS podcast. I’m Lily Clements, a Data Scientist, and I’m here today with David Stern, a founding director of IDEMS. Hi David.

[00:00:08] David: Hi Lily. What are we discussing today?

[00:00:11] Lily: I thought today we could do another on R-Instat, and this time delve in to the calculation system.

[00:00:17] David: Absolutely. Let’s delve into that.

[00:00:20] Lily: So R-Instat just as a starting thing is this kind of front end to R that we’ve been working on development for, it must be coming up to 10 years.

[00:00:30] David: It’s eight.

[00:00:31] Lily: Eight years. Okay. And it was meant to start as this kind of one year project. But it’s just grown and grown as my understanding is anyway, as more needs have come, or as the kind of potential of it has been realised, it started growing a lot more. And it’s just this free open source statistics software, which uses R in the background to run the codes.

[00:00:53] David: The original conception was that it was a combination between an old statistics software package called Instat, which was all about good statistical practice and making it easy for people to use data well. And really as a teaching tool and also for specific analyses in climate data and other specific contexts.

But to say that now with R as the powerful open source language it is, you don’t need a whole stats package. You just need a front end to use that in R to allow you to access all this power in R to be able to do this. And so R-Instat was developed and the idea, as you said, was to within one year get something which would be useful.

And arguably that was achieved. But one of the big things that happened in that year, and this is where the calculation system came out, is this recognition that, actually, the analytic packages that are out there, what already exists, are not really serving the needs that are currently there. And so actually this much more ambitious project emerged and the calculation system is part of this.

And I think in the previous episode we mentioned this sort of multi level data as being one of these big deep insights, that actually stats packages and generally analytic tools are designed to work on tables of data, rectangles of data, whereas almost all data which is useful nowadays is multi levelled, like a database of data, multiple tables which are linked.

[00:02:36] Lily: For example you might have data on schooling, and it could be students results within a classroom, students results by that classroom, but you could also look at it between different schools, between different districts. So we’ve got these different levels of, you’ve got your classroom, you’ve got your schools, you’ve got your districts.

[00:02:55] David: And there’s different things which are different in each district. And the schools also have differences. And so being able to understand or compare differences between schools and differences between districts, there is information at multiple levels. And you can always squish that down to the level at which you’re wanting to work or raise things up. But the data really lives at a given level. And most data we work with is multi level.

[00:03:22] Lily: And so why does this need kind of something to help with it? Why can’t we have this all in one big data frame? Or why can’t we just have multiple, you know, in Excel there’s different sheets, why does there need this kind of linking as you call it?

[00:03:36] David: If you think about the different sheets in Excel, if you want to use data from one sheet with another sheet, you need to have a way of linking. You can do this through functions, or you can do this in other ways, you can copy paste, or you can, do a lookup and so on. So there’s lots of ways you can do that, but the point is that the more you actually move things from one level to another for a specific analysis, the more messy your data gets.

And one of the observations that we had was that in contexts where people do all of this using scripts, then it’s actually okay because your unit of transaction is the script. In concepts where people are wanting to use more visual mechanisms, actually this having data at different levels in different ways is a real problem because you don’t necessarily know exactly what somebody did to get to a particular analysis. So what’s your unit of exchange? And so it actually… reproducible research becomes a real issue.

And so that’s one of the reasons people have moved, really encouraging everyone to learn to code. But we found that everyone learning to code doesn’t work in all environments. So the question is, how do you actually get around this in context where the skill set to be able to manage scripts and work with scripts isn’t as high. And actually one of the big things we found, and this is where I will get to the calculation system eventually, is that one of the real limiting factors on this is related to how you then work with different summaries.

And again, if you’re working within a script, in R, let’s say, then you have a beautiful language for this, especially if you’re using the tidyverse. You can actually have, in a really clearly readable way, exactly what your sequences are to be able to get you to your summary, or to the calculation, the thing you are actually working on.

However, if you’re working more visually with sort of point and click interface, that’s not as obvious. But in many cases, what you actually need is to be sort of authoring that script, those links. It would be nice if you could then produce that nice bit of code to say, this is what it is, this is how it’s been created, so then other people could reproduce it.

We’re not there with our calculation system yet, I would argue, but we are quite a long way. And it’s pretty powerful what is there in the sense that the observation that we had is that really, if instead of thinking about the script as being the story of where things have come from, we use metadata related to data to do that. So in our data frames, we actually store some of these things as metadata, including the links between tables. Then this can enable us to have the benefits of the point and click while actually then being able to really keep the precision that you’d get with the script, but to be able to use it in more targeted and in different contexts.

And so hopefully, when fleshed out properly, the elements of reproducible research that you can get from scripts, you can now get from these calculation systems built in ways which, from a user’s perspective, are a lot less intentional. What do I mean by that? If you’re wanting to write a good script, which is clear and articulate, and you know how hard that is.

[00:07:16] Lily: Yeah.

[00:07:16] David: If you want to share it, and that’s rather different from if you’re just going through and exploring the data and doing things. You do those two things very differently. Whereas the hope is that we could allow people to just do the exploration but then have everything they need to produce that tidy communication. And so really getting a bit of the best of both worlds is where I hope we can get to.

I feel that I’ve got a bit lost in actually the directionality for this episode. But I think, you know, this calculation system within this, I guess maybe what I should come back to is really this identification that at the heart of why R-Instat has become this much bigger task is this element that R-Instat contains some of these innovations in how we can make things like reproducible research more accessible to a wider community who aren’t necessarily as good at coding. And we can, I believe, make it so that it could just become automatic as part of anyone working.

It’s a long way from that but I think, yeah, maybe I should get more precise or less precise. I’ll let you come in.

[00:08:37] Lily: No, that’s really interesting. But these kind of data frames that you’re saying, or this metadata that shows this information about the tidying process and about those manipulations that have occurred to the data. I guess in a script there’s some linearity to it, because you can see we did this and we did this and we did that. But if you have a metadata, set of metadata, we still don’t have that linearity, that story of you did this and this.

[00:09:07] David: You just have what you need, in a sense. This is what’s so powerful about it. If you want to reproduce this, you have everything you need to reproduce it.

[00:09:18] Lily: But then let’s take ‘year’, for example, of sometimes with certain things in R you need that to be a factor. And sometimes with certain things, well, often we don’t want to treat it as a factor. But there’s some things that you do in R where it has to be a factor. I’m thinking of group buying.

And it would say in the script that you changed that temporarily to be a factor than to not be a factor. But in the data frame…

[00:09:46] David: If you’re storing this as metadata, you might not have that information, but this is exactly the sort of thing that we could build into the calculation systems. If you have a column such as, and I think it’s a really good example, because it’s something we’ve discussed in other contexts, where, actually, if you are, let’s say, summarising by year, then you need, because you’re going to group by it, you’re going to group the data by it, therefore you are treating it as a factor.

[00:10:16] Lily: Yeah

[00:10:16] David: It is naturally an ordered factor, of course, but actually, you want to still consider it as being numeric and maybe even as part of your date. And what if we could think about how to be able to use, let’s say, a column in a certain way, such as as a factor, but associated in the metadata, that’s part of the calculation system which has used it.

And I think that conversion would then be, let’s say, not a permanent conversion, and then of course you’ve got all sorts of issues that come around this. And really the calculation system, what it’s designed to do, is to keep the links between how calculated columns are calculated.

Think of this like in Excel. If you use a formula, then what you see is the results of the formula. But what you’re storing in the cell is the formula itself.

[00:11:20] Lily: Yeah.

[00:11:22] David: Now, we think of the same parallel related to the calculation system. That again, if you have a column then, which is a calculated column, what is the results of the calculation. But in the metadata, what you’re storing is the calculation itself, which generates it. And what that means is, just like in Excel, you can refresh, you can redo the calculation, and so on.

This is not novel, this is not particularly original as an idea. It’s relatively common. But what’s interesting is that, as a simple concept, to be able to do this is fundamentally different to, for example, in the tidyverse, how they would think of the calculations happening. Because the tidyverse is this wonderful piping process, where you actually are deliberately going from literally data frame to data frame.

And the whole pipe, the whole set, the calculation is the script of what’s happening. But what’s gone through is just the results. So you have two separate things. You have the script, which is the calculation, and you have the result, which is the result.

[00:12:40] Lily: Yeah.

[00:12:41] David: And these two are detached. And that is absolutely correct and sensible, if you’re thinking from a scripting perspective and a coding perspective. Your script is your script, it’s your language, it tells you how you get that. And once you’ve produced something you just want the outcome.

Whereas in many cases that we work, people aren’t wanting to think about the script. And therefore, really, part of what we’re wanting to do is we’re wanting the outcome to include the calculation, which is the element to be able to reproduce itself.

And it’s a different way of thinking about things which I would argue you don’t need if you’re scripting, but you do need if you’re using a visual system. And what’s interesting in another context is we’ve actually got certain cases now where the use of these calculation systems in for, I would argue, tailored summaries is the right word. Things like start of the rains. You’ve worked on this yourself, the end of the rains, which are needed by quite a wide number of people, but they don’t really need to know the details. More people need to use it than need to know the details of exactly how it was created or conceived as a calculation.

[00:14:09] Lily: Sure.

[00:14:10] David: This has enabled us to get to the point where in certain cases, it’s more efficient to use these systems, even for experienced R users, than it is to code it up yourself. That if you were taking that code, you’d actually need to take a whole chunk of code because it’s quite complicated. And then you need to be careful and so on. Or you could have it as a function, it could be embedded as a function, and you just need to learn how to use that function. But inherently, as people then are getting exposed to this using the likes of R-Instat, they are using these very quickly and efficiently, knowing what they need to know to use it.

And I guess what that opens the door to, I believe, is the fact that we could really make reproducible research much more accessible because if we can get systems in place, and I think we’re a long way off, and let me be absolutely clear, my long term vision for R-Instat is not related to R anymore, because the importance and need for Python, for example, as another language in data analytics is obvious now. And there was no reason why these same tools that can generate R code couldn’t generate equivalent Python code.

[00:15:37] Lily: Yeah.

[00:15:38] David: And it’s really exciting to see how, just eight years on or 10 years on from when the original ideas evolved, we’ve got to a stage where the concepts which have been built behind this, including these concepts related to this calculation system. And I don’t feel I’ve done a great job of explaining this, but maybe I’ll try once more. Let’s use the tidyverse grammar as a way of framing this. That within the tidyverse you have a grammar which is really powerful and it takes data frames and it does the manipulations on data frames in a way where you can pipe things together and you get to where you want to go and you have that as a really nice readable language so you can understand what you’ve done. It’s so powerful what they’ve done there. But, what’s gone through that has been a data frame, which at the end is just data, and at the beginning was just data.

My vision in the future would be, within this calculation system, you’d start with data, maybe more than just a data frame, it might be a data frame within a data structure with metadata. And you finish with a data frame within the data structure with metadata.

[00:16:57] Lily: And by these, you mean, in this structure, you mean there could be different linked data frames?

[00:17:02] David: Exactly. There could be different linked data frames and within that structure there’s also the metadata associated to dataframes. And you take that with you, and what you’ve got at the end is an enriched version of what you had at the beginning. In the current system you start with a dataframe, you end with a dataframe, or a tibble, or maybe you choose. And now I would envisage that we could actually have the same sort of language, but where actually the whole process you’ve gone through is now embedded also and included in the end result. And so you’re keeping it as metadata in the end result.

We call this data and structure, we call it a data book in some sense. That’s the whole thing. And I’m not sure that’s the best language, but it’s what we use at the moment. And the whole point of being able to embed these calculations in this calculation system, and what you’ve done in this approach into the data book means that now the reproducibility isn’t through the script. It’s contained within your object which you’re coming out with.

[00:18:09] David: And this fits in with other principles and programming in other ways in other contexts. And I believe that there’s ways in which this sort of system and approach, done right, could be really powerful in terms of opening up how we make reproducible research more accessible and how we just get people working together who don’t necessarily have the same skill levels.

And this ties in, and I guess this is really my final point, this ties in with sort of discussions that, I’ve been on the periphery of recently where people have been discussing how, oh, now ChatGPT can write your code. And to me, this is missing the whole point. Yes it’s great that actually because people are now able to use Chat GPT and the likes and they don’t need to learn and struggle with the coding as much, but it means that the emphasis was always on the wrong thing. We shouldn’t have been worrying about the code in the first place, because the coding part, that’s not the high level skill of what we need to focus on to be able to achieve what we’re trying to achieve.

If chat GPT can do it, then it was just a question of actually understanding the language. The problem was never about the language in the first place. And this is where highlighting what is it that we should be thinking about, what is it we should be communicating to one another, how should we be communicating it? These are really big questions, which AI and its use within this sphere is opening up.

And I believe really firmly that if we think about how AI can help us so that we don’t need to become coders, then I believe we’re also thinking more about how do we shift from actually focusing on the language that we’re coding in, be it R, be it Python, to thinking about what are we actually wanting to do and trying to do.

That’s what to me we’ve got distracted from. And that’s what I hope we’re trying to do with R-Instat. And I’m not saying this is the answer to this, but I’m saying that a lot of thought has gone into this, which I think is relevant to that question as well. Anyway.

[00:20:47] Lily: No, that’s… Anyway, I won’t jump into that conversation now.

[00:20:54] David: Maybe that’s another episode. We’re running low on time and I’m quite conscious that actually you’ve been deeply involved in the calculation system and I’ve been in the abstract when you could have been actually telling us about some of the more practical details. You might need to have another episode on this where, we actually discuss it a little bit more practically.

But I hope this has been useful as why I think it’s so important.

[00:21:20] Lily: Absolutely. No, thank you very much. As we say we’re out of time, but definitely we’ll continue this discussion.

Thank you.

[00:21:29] David: Thank you.