104 – Misusing Data

The IDEMS Podcast
The IDEMS Podcast
104 – Misusing Data
Loading
/

Description

In this episode, Lily and David discuss the unintended consequences of data misuse, highlighting how outdated survey responses can adversely affect individuals. They explore the balance between human judgement and automated systems, emphasising the need for improved data practices and hybrid approaches.

[00:00:00] Lily: Hello, and welcome to the IDEMS podcast. I’m Lily, a Data Scientist, and I’m here today with David Stern, a founding director of IDEMS. Hi, David.

[00:00:14] David: Hi, Lily. I’m looking forward to today’s discussion. You’ve got something you care about on the agenda.

[00:00:20] Lily: Yes. Yeah. It’s from a story that I’ve come across very recently from a friend of mine. Back in 2017, they filled out a survey when they joined the new doctors. So this was eight years ago. So 2016. Anyway, seven or eight years ago, they joined the new doctors and they filled out a survey.

And the survey asked various questions. It was just one of these surveys that people are given. They ask questions such as, how much do you drink, how much do you smoke and all of this? And they decided most people underestimate these things, but I’m going to answer this honestly, because I want to give them good data.

I’m all for that method of thinking, of course, as a data scientist. And that was all fine. And then we come to seven, eight years down the line. And this friend of mine has tried to get life insurance because they’ve recently bought a house, and it was declined. And the reason for it being declined was that seven, eight years ago the doctors deemed them to be drinking too much, and they were flagged up.

Bearing in mind that seven or eight years ago they were in a different kind of position, they were a student, which doesn’t mean that drinking that much is fine, but it was much more socially different to what it would be now.

[00:01:40] David: Yeah, the fact that this data is now the only data point of reference, and that sort of has long term effects on people. This is one of the huge, huge dangers of the world that we live in, where data is the new oil, but it’s not necessarily good data. Does the data really correspond to reality? Then exactly what’s the incentive provided in this sort of context? Well, the incentive is that you shouldn’t provide good data.

[00:02:08] Lily: Yes.

[00:02:09] David: So, the whole system is sort of set up in a way where the data being used by different actors in certain ways, they will just use the data that they have available. And whether or not that data is good, it’s there to make the decisions for them, and this is a really, really serious issue.

And it’s something where irrespective of, you know, specific context, which of course I can’t speak about, what I can say is in general, this problem of the data being available, being biased in certain ways, so that certain people are disadvantaged because of data they have provided for totally other reasons because things are getting reused. And the fact that this is providing now perverse incentives so that people should lie on the data they provide is so totally wrong.

This is just poor design of our data systems, and it is prevalent. It’s wrong on so many different levels. That data, in the first place is taken at face value when it’s analyzed rather than people sort of digging into it or actually quality controlling it is, I suppose, what I want to say, but it’s sort of relating it to the truth.

And these surveys are such a classic example of this, that everybody knows that people who fill out the surveys either get bored or, you know, as you say, they underestimate it because they don’t want to present themselves in a certain way to the audience they’re presenting it to. If you’re presenting it to your doctor, you don’t want your doctor to think necessarily quite how much you might be drinking as a student.

[00:03:57] Lily: Yeah.

[00:03:58] David: And so you present yourself in a certain way, but then this data gets used and compared and analyzed as if it’s the truth.

[00:04:08] Lily: Yeah, absolutely.

[00:04:10] David: And that’s the bit which I find so frustrating.

My wife’s PhD, actually, I still remember wonderful stories of this, where she had her questionnaire and then some of the people who did the questionnaire, she then did follow up interviews to dig into certain issues. And I still remember there was this one case where one of the follow up interviews was a hairdresser. She was studying microenterprise in Kenya, particularly women’s microenterprise.

And one of the people she was having this follow up interview, she sort of did this while they were doing her hair, and they agreed that she’d be a case study, and while they was doing that, as she was a barber, she did her hair, and so on. It was a very nice arrangement, but it also meant that they were in a sort of rather different conversational mode than you might be if you were being more formally interviewed. And so then the more sensitive questions, such as intimate partner violence, you know, there were horrific stories which then came out while she was getting her hair done.

And so she went back to the questionnaire, which had been filled out a few days before by the same lady, have you ever experienced intimate partner violence? No. Just so at odds with what actually came out in the more informal interview.

[00:05:28] Lily: I see, so I thought that was going to go the other way. So it could go one of two ways there, of like, being more informal means you’re more likely to speak about these things, but I was thinking it was going to go the other way, of, oh we’re in public, there’s other people around potentially, if you’re kind of at a hairdressers, and you’ve got other people in the vicinity, then you’re going to feel less inclined to say things.

[00:05:49] David: No, I mean, my understanding of this was that it was partly because they were with other people around and they were discussing this and there were other people in the hair saloon who were then sort of coming in with their stories, it was just a sharing environment, it wasn’t being interviewed, it was just sharing experiences. This was something which then, you know, led to that opening up.

Now, I think what’s of course important is that this is known, this is a story I’ve told because I’ve got personal experience related to it in this sort of way. It wasn’t me doing it, but my wife was. But it’s something which is so common. You know, it is known if you’re collecting sensitive data about these things, it’s known that people sort of bend the truth and tell it in a certain way.

And that question about whether it matters or not for the analysis of the results, it’s a really interesting and difficult one because it depends how you use it. And this is where it comes back to your original story, if these things are used in any form, to, you know, life insurance or anything which could affect other people’s decision making around you later, well then the incentives to provide accurate information are not there. You may as well provide the information which is going to help you in the future.

[00:07:18] Lily: Yeah.

[00:07:18] David: And that’s just so wrong on so many levels. If you’re going to do this, you need to have incentives which do sort of incentivise people to share truthfully. But how do you get such incentives? You know, that’s almost certainly not possible.

[00:07:36] Lily: Yes.

[00:07:36] David: If you have a sensitive subject, any form of incentive is going to push you one way or another. One of the interesting things that I’ve found is that in other contexts which are less sensitive, one of the things which we find works quite well is training and using youth in the communities to be the enumerators who actually collect that data for their community, on behalf of their community, can be very effective.

Especially if there aren’t incentives one way or the other, but to get towards accurate data, they could actually know things because they’re in that community, that therefore help them to get the right information and provide that. This is in farming communities in particular, where this sort of approach has been used, in a number of different contexts that I know of, where the training the enumerators in villages to sort of be part of the process of collecting the data with their community has and can be extremely effective.

[00:08:41] Lily: Interesting. Well, so I guess what I was going to add is that, I mean, I feel quite strongly that if you’re the person putting it honestly in the survey as well, then that’s… I mean, my PhD was on missing data, as you know. So this then means that if you’re putting it honestly in a survey, then that means that you’re kind of aware of it. You’re willing to talk about it. You’re showing that kind of level of responsibility.

And so arguably you might then be a better person, you might be the person that you want to be giving kind of insurance to, because you’re not hiding away from the fact that you’ve got these problems and that, but instead you’re willing to talk about it… It seems like you disagree.

[00:09:34] David: I don’t know. I’m confused. This is an interesting one. It’s not that I disagree. It’s that, you know, how do you know who’s answering honestly and who’s therefore willing to …? Actually, let’s separate out these two things. And I do recognize that I’ve taken us off track a bit with some of these other instances where you actually tie the more survey based with a sort of follow up interview.

[00:10:02] Lily: Yes, yeah.

[00:10:03] David: But I guess what you’re sort of saying is that you might have a bias towards the people who apply your answer more honestly anyway, but that’s not what we’ve seen.

[00:10:18] Lily: Yes, I guess what I’m saying is that it could be that firstly, you are now inclined to answer less honestly, because you don’t want to have negative impacts occur later in, you know, seven, eight years later, even longer later as well. There was someone after my friend and I discussed it, we were like really looking it up and there were people saying that it happened to them 30 years later.

[00:10:45] David: Yeah.

[00:10:46] Lily: 30 years later that they were denied it still, 30 years after doing a survey.

[00:10:53] David: Yeah, and it is this element of, you know, the right to have your, to some form of anonymity, you know, in the past for some of these things, the right to access to some of these services. There’s so much, you know, the data that’s available.

I do want to come back to try and dealing with your statement about the people with the sort of follow on interviews, or follow on case studies sort of things. But just this ability to be able to have information from your past, not necessarily easily available. This is one of the worries that many people have about young people putting so much information out on social media now.

[00:11:36] Lily: True.

[00:11:37] David: That that is now part of the internet and that will always therefore be there as part of what’s known about you in the future.

[00:11:46] Lily: True. And I know of people that have had jobs declined because of what they’ve posted on the internet when they were a teenager and in a different mindset to where they are now or just in a different stage of life.

[00:12:00] David: Yep.

[00:12:02] Lily: And I suppose with that, you know, if you did something as a teenager you would be tried as a child, if you did something illegal, I mean, as a teenager, you wouldn’t be tried as an adult. But I suppose I know people that have written things, not anything bad, well, not anything, like, horrific, or done anything horrific, but just a joke in bad taste, say, where they wrote it as a teenager and then it’s come back to get them 10, 15 years later, but they’re now kind of tried as an adult in a way. Well, they should have deleted it, obviously.

[00:12:34] David: But even that right to delete it in different ways, these are, I don’t have the answers to this, but it is something where when so much is out in public, but not everything is out in public, you see what you see, not what is. Which is a very interesting and difficult, process for anyone wanting to take that data and use it for decision making.

You can only use what you have visibility on in decision making to make decisions. So from the insurance company, this is the only information they have. And so that they put things in to make decisions based on this. Well, yes, from a financial standpoint, I can understand why it is cost effective for them to use the data they have. They’re going to eliminate a few people, some of whom maybe they shouldn’t have eliminated, these false positives in a sense, whereas some they should be eliminating. They’ll have missed anyway, the false negatives. And they just accept that as part of their risks.

But for an individual who gets caught up in this, who is the false positive, they’re really seriously affected. And for the individuals who lied and therefore lead to these false negatives, well, they get away with it and then everyone else’s insurance premiums go up, in a sense because of that, to take into account that risk, because the insurers will never lose out. You know, this is something where the consequences for that individual, you get into game theory now.

[00:14:19] Lily: Yes.

[00:14:20] David: The difference between what’s good for a group versus what’s good for an individual.

[00:14:27] Lily: Interesting. Yeah, okay.

[00:14:30] David: But there’s so many areas of our society where the same scenarios pan out, where questions about false positives and false negatives, things being misidentified in one way or another, lead to huge consequences. The same was true during COVID, you know, in the lockdowns, where you had the more extreme versions of the lockdowns, where once you tested positive for COVID, in many cases you were enforced a lockdown.

[00:15:10] Lily: Yeah.

[00:15:12] David: And I heard this mostly in places like the US, where that policy was also happening at the same time, where the financial consequences on that of people in vulnerable financial situations could be devastating. And so now your incentives for actually testing correctly and acting correctly on the test are all wrong. This leads to so many really serious issues because no such test, no such data is perfect.

And the scientists who believe them understand that. And the fact that this has negative consequences on the individuals it can’t overrule the positive values you have for the, you know, the spread of what is otherwise a deadly disease. Yeah?

[00:16:07] Lily: Yeah. Yeah, I see what you’re saying.

[00:16:11] David: So the equivalent for your case is that having an individual who is misidentified or identified incorrectly related to receiving insurance, or being flagged in different ways for this, you know, not having flagging would have more negative consequences for society.

But having someone flagged for the wrong reasons, as you sort of potentially described it, this is something where that has really serious negative consequences for those individuals. But there isn’t a way to avoid, once you put these systems in place, there being misidentifications.

[00:16:56] Lily: Well, no, there isn’t a way, there isn’t a foolproof way. You’re always ten steps ahead, but that’s when we come to what you were saying at the start about having these follow up interviews to actually verify, okay, how good is our system? How correct is this method? Does this make sense?

[00:17:14] David: Exactly. How much does the data we have actually correspond to reality?

[00:17:20] Lily: Yeah.

[00:17:21] David: And this is where, you know, mixed methods, having sort of relatively large scale surveys, followed by a small number of in depth interviews is very well established as an approach to trying to get data and having an idea of what that data actually corresponds to. This is very much standard mixed method practice.

[00:17:40] Lily: I guess, we’ve said it before with any kind of modelling approach or, you know, with AI, like what the model returns, so in this case either you have COVID or, okay, we shouldn’t grant this person insurance. We shouldn’t give this person a credit card, you know, on these different case studies we’ve seen in the past…

[00:18:00] David: Yeah.

[00:18:00] Lily: You should still verify, the human should still have the last say.

[00:18:08] David: Absolutely. But again, this is where it’s a really complicated one. So in general, I think that’s correct. And I think the insurance is a good example. Who should have the last say about whether insurance is granted or not? If the human has the last say, then actually it’s favouring the people who can build those human relationships well, who know how to play the system, who know what to do in those human interviews and in those human interactions.

[00:18:42] Lily: True.

[00:18:43] David: If the data, and I’m going to call it that because that’s essentially whatever the algorithm is, it’s all coming out of the data. If the data has the last say, then in some sense, you’re favoring people who populate the data in ways which are good for them. If people aren’t just providing truthful data, if people are providing data which benefits them in certain ways, then they’re the ones who are gaining an unfair advantage.

So the simple truth is, if we actually wanted better systems, we need to be bringing these two skill sets together. So, you know, the data which is sort of built up towards the decision, this is something where being able to have, if you want, human interviewers who are maybe not trying to use data to make a decision, but trying to use the data to validate, trying to use an interview, let’s say, to validate as the first step, might be an interesting way, because if you can identify the people whose data is trustworthy or not, then maybe you can actually then look at the data and sort of understand how to make decisions based on that data when it’s trustworthy, or how to make decisions based when you know the data isn’t trustworthy.

You know, these are sort of things which could be different in different ways. I don’t have the answers to this at the moment, but I do know these are the sort of questions that we as a society should be grappling with. How do we get good data, which is going to help better decision making, which I believe should be human decision making, but enhanced with these AI and other data science approaches?

I don’t think we’re there yet, and I’m very worried by the fact that most groups I know, their motivation is to remove the humans from the loop because that would be more cost effective for many areas of decision making. Whereas if we were thinking about it differently and our main motivation was to make the best possible decisions, then my guess is we’d actually get some sort of hybrid blended sort of system which would still have humans very much in the loop but it would be thought of in a very different approach. And how the humans then feed back into the models, and the models relate to the sort of human discussion I think would be fascinating.

But I think that there’s a lot which could be thought of, or worked towards, which we don’t know how to do yet. And it’s interesting to me that even if you look at the science fiction on this, forget about what’s actually possible now. I don’t see many illustrations of novel hybrid ways that data is enhancing potentially decision making in powerful ways.

I don’t know. I feel that it’s an area where human imagination is probably needed to help us actually think through how we could use the combination of the data that we’re now able to process and which exists in various forms alongside humans in the loop.

[00:22:16] Lily: Yeah, because it then becomes one of these human problems. I guess, at the moment my life is very much fit into two categories. Is it a problem that the computer can solve or a problem that I need to solve?

[00:22:32] David: Whereas the hard problems, we need both.

[00:22:37] Lily: Yeah.

[00:22:38] David: I mean, I think that you’ve summarised this really well. This is something where really if you either just have the human or you just have the machines as you like to put it, you are going to be subjected to a sort of set of biases which are quite substantial in different ways.

[00:22:59] Lily: And those biases are both substantial, but then they can also lead to, what they then cause is, you know, the bias in the data then causes you to have bad data because people will be, okay, I won’t be as honest in these surveys.

[00:23:15] David: Exactly.

[00:23:17] Lily: The biases with the humans will then lead you to have, again, bad data. Both of them kind of creating a worse and worse and worse situation.

[00:23:27] David: I think there’s definitely an element, and I felt this myself, in that the truth will win out is not necessarily what we are observing with the ways in which data is being used in our day and age. And I think that’s very much a consequence of how we, as a society, are building the skills and the tools we need.

And often how we see it is a black and white decision, you know, something is good or something is bad. And you see this even in just the way I sense society is moving to the extremes. And I think that really does correspond to what you’re describing in some sense of where if you have incentives in the system, which are maybe well intentioned, but lead to consequences where you’re incentivising people to provide misinformation, wrong data, that will strongly affect the society as a whole.

[00:24:37] Lily: Well I think particularly in this original instance, or even with your COVID example, this isn’t like information in a survey that you’re given on the side of the street when you’re about to go into a shop. It’s information about like your health for doctors, it’s data that you want to be correct for your doctors. But then there’s these now systems in there to say, okay, well, you shouldn’t be honest with the doctors. I mean, I’m obviously being quite extreme. And then it’s the same with your COVID example of the COVID tests of who gets affected by us lying on the data, say, or by us kind of misconstruing the data.

[00:25:20] David: But this is where it’s so easy with any given set of procedures or processes and so on to see how individuals can benefit from a little white lie. Oh yeah, I only drink two glasses of wine. Oh you meant a week, I was thinking a day. I just misunderstood the question, or whatever it might be. Yeah?

[00:25:50] Lily: Yeah.

[00:25:51] David: So, you know, getting misleading data, it’s very easy to do, it’s very easy to justify unfortunately. And it’s something which will never come back really to haunt you. Whereas, as you say, there’s certain areas of actually being truthful about that data can come back to haunt people and you have these instances that you know of where that has happened.

And I feel that as a society, we are really, really, in our infancy still about how we live and how we work with data. The things you’d expect from a mature system, the signs are just not there for that now. I mean, we’ve talked about the hype around AI in the past as well, but the hype around how, you can use data for all sorts of things that is not accompanied by an equivalent scrutiny of does the data actually contain the information we need for those? There’s not that same scrutiny, and if the data doesn’t have the information in the format that it’s needed, there’s too much money invested often to sort of pull back at that point.

I guess what I’m keen to sort of, well, it’s concluding or not, but keen to just repeat is when we get as a society to the point where we recognise that giving incentives for people to have the right information stored about them, that opens up the possibility for so many other things to just work.

And I just want to finish on this because we’ve got a very concrete example of this in our work with farmers and low resource environments, where at the moment, a lot of data is collected on them. Whereas one of the things that we’re aware of is that with the right tools, you could have data which they collect for themselves, which could then be shared.

And the quality difference of that data, if it works out well, could be transformative. And so actually changing the incentives about what data is in there and why, and who it’s used for and who it’s serving, and whether you should be worried or not about who else might get access, or whether you can trust the people who hold that data.

It’s rather interesting to me that the example you’re giving of the data that can’t be trusted in some sense is an example where the data was originating from maybe one of the institutes in the UK that people trust the most, your public health service. And that example of that transfer of data almost being a breach of trust in that way, is a really interesting one.

[00:29:05] Lily: Absolutely.

[00:29:06] David: Because the lack of insurance being provided because of a health warning somewhere in your health data, this is something where that’s certainly not in the public interest. This is in the private interest of the private insurance holders. That public data not serving the public benefit is a very interesting one. Anyway.

[00:29:36] Lily: Anyway. No, thank you very much. It’s been a really interesting discussion, a bit longer than we planned, I’m sure.

[00:29:44] David: Yes, sorry.

[00:29:46] Lily: I guess just to have the final summary because I kind of cut you off a little bit before, of where you were saying humans and data, and actually what we want is we don’t want to have a system of just humans or a system of just data, but a system of both.

[00:30:02] David: We know that the systems where you have it purely on human based decisions have been subjected to real biases, which have gone across all sorts of lines in the past and have had problems. And we know that a system with good data can address some of those biases. However, what your example highlights is that a pure data driven system can have other biases inherent to it and that really, we need to make sure that we’re finding the right way in the middle.

Now, of course, there’s another problem around this, is the problem that the insurer turned down someone because of their health risk? Because actually, surely, in many ways, the people with health risks, in many ways, they need insurance as much as anyone else and maybe more. So this is something where is it that there was a problem with using this data, with how it was used, which I think the case that you described there was, but it could also be that the problem is, well, should this be a private service in this context, if it is something where that means that vulnerable people are therefore not able to access it.

Now, I think life insurance, the example that you used, this is fair enough, you need to calculate the risks and so you need some way of discerning the premiums and the rest of it. Whereas there are other services, other forms of insurance, if it was a form of health insurance, now, luckily in the UK, you have your NHS, which is providing the health insurance in general. But if you’re in the US, and this was for health insurance, and you had something which meant you couldn’t access health insurance.

[00:32:13] Lily: Gosh.

[00:32:15] David: Yeah?

[00:32:15] Lily: Yeah.

[00:32:16] David: The consequences of this could be even more serious.

[00:32:19] Lily: Absolutely. A lot more serious. Because I guess, well, you only need life insurance once. Whereas health insurance you might need many times to pay out, I mean.

[00:32:36] David: Yeah. And if your whole system is based on private healthcare, the most vulnerable people in society don’t have access to affordable premiums, then they will not be able to access healthcare, and so you have a whole section of the population that is therefore unable to receive basic health services.

[00:32:59] Lily: Yeah.

[00:33:01] David: Anyway, we have gone over time.

[00:33:03] Lily: Absolutely. No, I know we’ve gone over time. It’s been a really interesting discussion. So thank you very much.

[00:33:08] David: Well, thank you for bringing such an interesting topic.

Maybe it’s worth finishing on the fact that I don’t believe the technology is there yet to do this well. I think this is something where I wish there was more effort going in to understanding, well, how could you get systems which are put in place to be able to integrate the data which exists with human evaluations of that data in a way which actually enhances the truthfulness of the data.

[00:33:38] Lily: Yeah, absolutely. Well, thank you very much.

[00:33:42] David: No, thank you.