00:00:00.0 Y: Okay (6 starts sorting cards) 00:01:43.1 Y: Okay, so this is the file homo_sapiens.GFF So, are you always saying that the first identifier 100287102 is a gene? 00:01:53.2 Participant 6: Yep, yep 00:01:55.1 Y: ...and identifier for the second HGNC identifier finally name for the last one? 00:02:00.1 Participant 6: Mhmm. 00:02:01.0 Y: Okay, fantastic. So there's two more files and you can reuse the cards if you wish. So this file is flybase_D_melanogaster.gaf (2 starts sorting cards and filing through them) 00:03:48.1 Participant 6: Tricky. 00:04:02.0 Y: Okay. 00:04:36.2 Y: Okay, so we have for the flybase so far we have data set, database for the first blue item FPGA and 0043467, and GO term for the second one, which is a GO term identifier. Okay, thank you. Final file is this one. (2 starts browsing through cards, placing some on the table; first GENE, further actions not visible on vid; removes GENE and places NAME) 00:08:17.0 Participant 6: Think I'm good with that. 00:08:17.1 Y: Okay, could I get you to (???) as there's quite a few on this one just to decide which one matches to which? 00:08:22.1 Participant 6: Yeah. So, basically running order there's identifier 9606, gene is the one symbol there's chromosome information and then gene name. 00:08:34.0 Y: Okay. And so why did you choose 9606 identifier? 00:08:39.2 Participant 6: I guess the way I see it is so you have species or yeah taxon ID and whenever it's sort of an idea or anything like that, yeah, I tend to think it's kind of an identifier. 00:09:00.0 Y: Works. Okay. I was just wondering whether you'd noticed it was the human taxon, but clearly yes. 00:09:07.0 Participant 6: Yes. Yes. So I yeah, I tend to work with mice. 00:09:15.2 Y: Yeah, there's one struggle I've had actually is that no matter what organisms I choose, they will be wrong for some people, and right for other people. So I've gone through like this mix of fly files and human files, because they're very popular and well known, even if you don't necessarily know the area. 00:09:31.0 Participant 6: And to be fair, this is, I think this is part of how I think about these things. Because it's, to my mind, it doesn't matter. You know, so I don't necessarily look at that (points to blue highlighted text) and think, human unless it's something I'm particularly working on. But you sort of think, well, I know, I know what it means. Even if I don't know the specific thing. It's kind of like being in the lab, you know, you're working with loads of different species, but they're all just colourful liquids in the tube. So you kind of know what they are, but it's a kind of container. 00:10:04.1 Y: That does make sense. Okay, so that's the end of the first task. And so I will take away this last pile of scrap, oh, this was the for the final file which is homo_sapiens.gene_info. And so the second task also involves the pink cards, but this time what I'm going to ask you to do is to sort them in a way that makes sense to you and try and explain why you are doing so. So I may ask questions if I need to, you can ask questions if you need to, I may not be able to answer them depending on what the questions are, and if you feel like any cards are missing, you can make your own cards on the yellows, just so that I can fish them out and mark them later. 00:10:44.1 Participant 6: Okay. (Starts to place cards on table) 00:11:04.0 Y: (points on a pile) I think accession's two cards. 00:11:05.0 Participant 6: Thank you. 00:11:25.0 Participant 6: There are lots of ways I could do this. 00:11:28.2 Y: There's no wrong way. And also if you feel like you need to, like think about it or sketch or make notes or anything like that. I've got scrap. 00:11:39.0 Participant 6: So you want one single arrangement of these? 00:11:41.2 Y: It could be clusters. It could be piles. Whatever makes sense to you. 00:11:46.1 Participant 6: Oh, I see. Okay. Okay. Yep. (2 is arranging cards on table) 00:13:54.1 Participant 6: That one's tricky. (Holding card in hand; unidentifiable to coder). 00:14:08.2 Y: So you can leave it aside if you don't feel like it belongs with anything else. 00:14:50.1 Participant 6: Ok. I'm done. 00:14:51.2 Y: Brilliant. Okay, can I get you to explain what the piles are and why they have the content they do? 00:14:56.0 Participant 6: Yep. So these are biological in my mind (points on cluster containing the like of CHROMOSOME, GENE, D. MELANOGASTER, P53, XY; these are the visible one, but there are more). So, they are examples of actual species or the biological concept or a theme. A gene a chromosome, a transcript or an organism. These to me are descriptions (points on next pile). So, expression is a construct, it's context dependent. Similarly, molecular weight, the length of something, what pathway it may or may not be part of these are basically kind of descriptors. Yeah, kind of descriptors, whereas these are actual just, you know, (points on first cluster) biological things to say that properly. These are identifiers (points on cluster not visible on vid). So, there's a mix of, you know, a single gene can have a name, identifier in a database, a particular symbol, an accession number, these are examples of these. So, there, they have no biological meaning other than an identifying purpose. (Moves on to pile at the upper left corner of vid, cards not readable). These to me are kind of metadata type of thing. So where it was published, an example of a database, it might be in links to databases and sets, authors, IDs. There's a bit of crossover here because these are also identifiers, which is why they're kind of together. But to me, these are kind of biological identifiers. These are kind of metadata, and identifiers. And then these are a little bit of a class of things that are more study based (cluster in upper middle of vid). So rather than being you know an organ or some descriptive factor, it's a condition. It's a kind of space that one actually could have come in here. (Moves BRCA1 to first cluster; the "biological' one). 00:16:58.1 Participant 6: So that's yeah, so that's Yeah. So that's Kind of a, like applied condition. These are kind of basic biological things, these the identifiers, these are kind of metadata, metadata identifiers. And these are kind of descriptors. 00:17:17.0 Y: Excellent. Okay. So just out of curiosity was BRCA1 in here because you were thinking about cancer? 00:17:27.0 Participant 6: Possibly somewhere. I don't work on it myself, but I know enough people talk about it. It's probably in there a little bit somewhere. Yes, possibly. I think that would be fair to say yes. 00:17:36.2 Y: Right. Yeah. So this is another one of those things. I'm going to try to decide on the organisms to use and try to assign which genes to use because you want ones that are famous enough people would know that. Yeah. Okay. I'm just going to quickly verbally read this out. So in the biological pile, scan this over here, we have transcripts, chromosome, organism, gene, BRCA1, H Sapiens, D. melanogaster, XY, homologue, protein and P53. Then the descriptors. Is that what you called this pile? 00:18:09.0 Participant 6: Yeah, yeah. 00:18:10.1 Y: Expression, molecular weight, length and pathway. And then we have the study related things which are asthma, diabetes and disease. At the top, we have BRCA1_Human, PubMed ID example, GO term, I think a protein Q9H4C3_Human, identifier, symbol, accession, name and a specific GO term example. And then we have the metadata related cluster which is publication, DOI, author, a specific example of a DOI, PubMed ID, data set/database and uniprot. I thought I had PubMed ID in here. I did okay! Do you mind me asking, why is the PubMed ID here and not here. But because it's an identifier? 00:19:00.1 Participant 6: Yes. Yes, it's this is what I'm saying this is a slightly kind of I link these things, these identifiers are relevant to the databases that they are in. A GO term is a GO term in a GO term database, you know, and and so, you know, you've got a sort of, yeah. Theoretically, I suppose that could have gone in there, or 00:19:28.1 Y: If it made sense to you there that's what I want to hear. I don't want you to tell me about my mental model. 00:19:36.0 Participant 6: That's probably diagnosing issues with my data analysis. 00:19:45.1 Y: Okay. Do you feel, so, we've definitely thought that there's a relationship between these two because these identifiers, these are metadata, so that's related to them. 00:19:52.2 Participant 6: Yeah. 00:19:53.2 Y: Do you feel there any other relationships between the clusters? 00:19:56.1 Participant 6: Yes, these two quite strongly (links biological cluster with the descriptor cluster above it in the video) so I mean, again, you know this one you could say, well, a specific molecule has a molecular weight. So you could describe it as a kind of core biological thing. But yeah, I don't know it just felt more like a kind of descriptor rather than a, but I guess that's probably a bit wishy washy. 00:20:23.0 Y: No, that's fine. I like wishy washy. Okay, so I have two more questions. One is do you feel like anything is missing? Like there should be some other cards here? 00:20:47.2 Participant 6: You've got a database (tapping on pile) that's...I guess no. It would be good to know, this is very data specific. But if we're talking about biological context me from what I'm doing at the minute, you know, with with these kinds of things, I mean, dates are important, you know, technology that was used to generate the data. 00:21:30.0 Y: Provenance. Technological provenance, excellent! Dates. (I writes yellow cards) 00:21:35.2 Participant 6: Finally, with my stuff in the middle, it's making a huge difference. So I think, oh, there's not as many genes in this one, then you look it up and find that it's some [ Y: version three], or it's just a really say, pants assembly from 10 years ago, and you're trying to compare it to one that was produced last year and it's they, they're not comparable. So that's really helpful to know. 00:21:55.1 Y: Right? So you mean like, for example, just the genome assembly or the tool, or version? 00:22:02.0 Participant 6: Well, in my specific case, yes, I'm talking about, so for instance, the genome assembly, the quality of that assembly, the technology that was used, because obviously, if it was a sort of nanopore, or something versus a old school, Illumina or something like that, you are going to treat the data differently. So that's a helpful thing to know. 00:22:26.1 Y: Definitely. Were would you put these two in that case? (2 moves yellow card between the two left clusters) So that's another bridger technological provenance, assembly between the metadata and the identifiers (2 picks up another yellow card) Yep. Okay, fantastic. Anything else you want to add or ask about this before we stop recording? 00:22:55.2 Participant 6: No, I think it makes sense. 00:22:56.1 Y: Brilliant. Okay, thank you. I'm stopping recording. ***PART 2 starts here*** 00:00:01.2 Y: Right, is there any entry point or any particular bit of data that you think is more important or more exciting, more interesting than the other bits? 00:00:10.0 Participant 6: An entry point that is very question specific from, from my perspective, so it depends what I'm trying to do. I think for me, that's my entry point (points on cluster which is not visible on bottom right corner, probably the "biological pile" which contains "transcripts, chromosome, organism, gene, BRCA1, H Sapiens, D. melanogaster, XY, homologue, protein and P53."). I often find myself looking up particular genes or, or more often than not searching gene lists doing GO term enrichment and that kind of thing. So I would get very excited with significant GO term enrichment or if particular genes appear in a gene list in their genes that I'm interested in for whatever reason. So that would kind of be my foot in and then once in that, sort of place it very quickly you start looking around the structure of genes, gene copy numbers, etc, etc. And then, embarrassingly enough, this probably comes a little bit later where I'll actually go, then, once, once I've got a hook in, I will then go and have a look at, well, you know, look at that data itself, the technology that was used the reliability of that data, so I guess it's probably like some 1, 2, 3 (points at 3 clusters). 00:01:22.1 Y: Okay. So you'd probably start with things like identifiers, look up your favourite gene name, for example, like that. And then you might look at more biological, concrete specifics, and then you'd find out about the metadata and the provenance of your data. 00:01:35.1 Participant 6: Yes. And this is just to me, that's the only way that kind of makes sense because everything's complicated. It's a bit of a stupid thing, say biology is complicated. If you go in here, to my mind, you never gonna find the specific thing you're looking for. If you've got a sort of hypothesis driven research, you've got a question that you're trying to focus on. Then. By virtue even if that question has things to do with all of or any of or all of these things, you've got to start with some kind of order, you've got to be able to identify specific parts of that kind of crazy mass and start picking it apart rather than just going into the crazy mass and going, okay, you know what's important or not here. So to me, this is really only kind of the natural way in but from what I'm doing, at least, I realise there's a lot of research out there, but from what I'm doing, that would be kind of the only way that I would go in. Because here I lost very quickly. 00:02:31.0 Y: So if I paraphrase, we need to find these as the filters that get you to the relevant bits of data. 00:02:36.1 Participant 6: That's a very more succinct way of putting it. Yes. 00:02:38.0 Y: Okay, good. So it means I've understood the concept. Okay, now, I really am stopping.