[sorting through cards] Participant 4: It's gene, name for a gene, ooh, different combination, accession, okay, so yeah. Disease... Ahhh (places card). Database 00:00:43.0 Participant 4: Really interesting things to try and... especially because some of them are things that are kind of the link between certain aspects. [Ah] That's a different one. Okay. 0:01:11.5 Y: I think that by pathway and homologue you have double cards that are stuck to each other 0:01:15.3 Participant 4: Yeah, yeah 0:01:15.3 Y: Ah you knew that, okay. Participant 4: Yeah, I'm just trying to get a sort of overall... thought. Okay. Right. I'll put those off to one side [Dataset, Uniprot, Pubmed id, specific id] because they are... (trails off) 0:01:36.7 Um, so the things I'm putting at an angle are actually things that are types of the thing that's that they are under [visual examples of this include GO: TERM, and a specific GO term underneath it, as well as dataset/uniprot and PubMed id + specific id] 02:00 I Kind of, um, kind of feels like what I want to do is actually link some of these things in, um, so we actually have a line between some of these as well. 0:02:14.2 Y: yeah, would you like a bit paper to sketch it - would that help? Participant 4: Um.... yeah. 0:02:20.8 Y: Oka Y: I have some of my previous forms that I didn't use that I can use for scrap (sounds in background of fetching scrap paper). 0:02:55.2 This is all scrap, feel free to.... Don't know if these work (hands markers offscreen) but if you want to use them you can. 0:03:09.8 Participant 4: I might, I might be overthinking it, it's just that it feels like... I guess like it depends what I'm trying to achieve. It almost feels like I'm trying to make a sort of database schema, so I've got things here like organism and chromosome, obviously chromosome x and y belong to an organism, in this case homo sapiens. 0:03:33.7 Y: Um, this is great so.... there's definitely no wrong answers here, so don't worry about that. 0:03:43.9 Participant 4: that's cool. Ummm. And like, I've got publication here, and PubMed ID, well, publication HAS an author, and PubMed ID... IS A thing that has a... yeah. Ummm. 0:04:22.6 Participant 4: So here's the way things kind of start to fall out of my mind, is that we've got of very high level stuff that's kind of abstract, things like publication, organism, chromosome, or disease - it's things that you'd kind of um, maybe use as your starting points for actually collating data about what something is, and then you've got things like Gene, Transcript, protein, ah... expression data, I guess, goes on transcript, accession, name - I'm really not sure where to put those, because they're just kind of like, they're such vague concepts. Length again, it kind of goes here somewhere but is it the length of the gene, transcript, protein...? 0:05:23.7 Participant 4: I guess database is probably sort of higher level, this one again, it's hard to place. 0:05:32.3 Participant 4: DOI, I kind of what to put it next to publication, but if I'm being really strict, it's not... so I guess, I'm going to put it next to publication, because lots of different things can have DOIs, maybe secondary analysis. I'm gonna, yeah - more things should have DOIs. 0:06:08.1 Participant 4: I've got some gene names here - specific instances for gene name, um. I'm sure I had BRCA earlier. 0:06:19.8 Y: You did, is it... hmmm. Help, our gene’s gone missing. Is it on the floor? No, no floor genes. We definitely did have BRCA. 0:06:33.4 Participant 4: [ Looking under a piece of paper] there we go! Y: Ahhhh [laughs] Participant 4: Ummm, so I guess that would kind of lead into that in some way.... [referring to BRCA1 and BRCA1_HUMAN] Pathway... stick that over there. 0:06:56.2 Participant 4: Molecular weight - stick that with length. GO Term....ahhh, stick it between the biological stuff and the classificationey stuff. 0:07:11.2 Participant 4: Ah, diabetes, we can put that with disease over there. Homologue, put that up there, symbol stick it up there because it.... 0:07:24. Participant 4: not actually familiar with what P53 is. Um. Are you allowed to give me any hints? is it? 0:07:28.4 Y: ah, yeah, I think it's reasonable to say this it is a protein identifier 0:07:38.7 Participant 4: right, okay, so I think these want to be over here... this... like... [arranges cards] 0:08:24.1 Yeah, okay. I think that's kind of where I'm at, so - those are just things that you'd find in your database, those are things that kind of like - collections of data. Very high-level stuff, or something that's characterising collection of data, disease, and these are low level things [gestures to different piles each time] 0:09:00.3 Y: Okay, so you mentioned you felt like there needed to be some links between some of these groups. Can I ask what they are and why you see them? 0:09:09.7 Participant 4: yeah, so I mean I guess form a sort of database point of view, your publication will have an author, a PubMed ID, I guess also organism, so human is going to have certain chromosomes, diseases link back to an organism, you're going to have some sort of interaction I would assume between your gene your protein, it's homologues, your transcripts, expressions. yeah, it's definitely feels like what I’m looking at here is a mixture of tables and things that go with tables, and things that go in tables, and some links between tables. 0:09:59.7 Y: Okay, so one final question - if you were to draw links between these entities, where would they be? Just for the ones that aren’t on the paper - I think the links here are pretty clear. 0:10:12.6 Participant 4: so, let's see. Umm, gene, is gonna have I'm going to have to take out the items that are in here. Gene is going to have transcript. your transcript is going to have expression data, and protein data. Um homologues- I guess that would usually be linked through genes. Pathway data - probably comes off expression. Um, those things that belong in that table I guess. 0:11:07.3 GO terms. Uhh- I guess they kind of apply to all kinds of aspects of this. Y: Okay Participant 4: Yeah, I guess I'd probably start at the gene level and filter down from there. That’s probably because I've spent time working on ENSEMBL and that kinda - that's 0:11:31.7 Y: Okay, yep, that is meaningful. Do you feel like there was anything in this dataset that should be added? 0:12:03.2 Participant 4: I can't think of anything immediately. Y: Okay Participant 4: Nothing's jumping out. 0:12:16.4 Participant 4: Possibly variant data, but I'm not sure... 0:12:24.6 Y: [writing "variant data" on a card offscreen] I'll put a question mark beside that, variant data. Where would you stick that in the model if you did? 0:12:33.5 Participant 4: That would probably hang off somewhere around there, with the gene. [place card between gene and transcript] - I think something like that 0:12:44.5 Y: Okay! right, that was really fantastic, really useful - every single time I do this it's incredible interesting. What I'm going to do - I'm going to take some snapshots with my camera, if it reaches properly. So, first of all - can I take a photo while I'm recording? No... Participant 4: I guess if you just record it, you can always pause that. 00:13:25.1 Y: So we have dataset, database, with uniprot beside it, next to this we have organism which has H sapiens and D melanogaster directly attached. Chromosome xy is hanging off organism, as is disease with diabetes and asthma as examples, we also have nearby DOI with an example DOI. Publication, which links to author which is linked to PubMed id and the example of PubMed id nearby, we have name accession symbol identified - those are the 4 that were a bit too generic to stick anywhere -can you um, correct me if I'm wrong at any point? 0:13:55.1 Participant 4: Yeah, nah that's definitely how I felt. 0:13:56.5 Y: um, we have a pile of examples of the various terms I'm about to discuss at the moment, which are molecular weight and length. Oh no, sorry, those were also the generic ones. 0:14:07.0 Participant 4: Yeah, they were kind of slightly off to one side because molecular weight and length are slightly more specific to some of these things (gene, protein), whereas name, symbol, accession, identifier, you could find them I mean you could apply these to all these datasets. 0:14:24.9 Y: Fair enough. Okay, so we have gene, and basically everything hangs off genes so we have GO term, we have homologue, we have transcript, variant data, protein, expression, pathway, which sort of form a line - actually, is it transcript, expression, pathway? 0:14:42.4 Participant 4: I think I’ve got it down as transcript, expression, pathway. That’s not a hill I'd be willing to die on. 0:14:50.1 Y: They're a friendly cluster Participant 4: Yeah Y: Okay, that's fine. 0:14:53.0 Participant 4: Umm, the oh, actually, thinking of things that are missing - I don't see any chromosome position. Y: Ahhhh. I suppose I should write it down [presents card with chromosome position written] 0:15:16.4 Participant 4: yeah, so actually, probably sit up at the top. Um, like, most things, you'd start with the position of a chromosome. and that would define the position of your genes and transcripts. Y: mmhmm, okay. Participant 4: Yeah. Y: would that change the way any of those are linked in? I0:15:37.3 Participant 4: Umm, I mean, it would probably, you'd probably see the position of the chromosome would be linked to the chromosome. Um, but probably those still kind of sit to a different side. Maybe that would sit over here kind of place - organism would sit above that. [the organisation is now organism - chromosome - chromosome position - gene - transcript] 0:16:01.6 Y: Right, so chromosome could be the link out to organism, and to - that makes sense. 0:16:06.3 Participant 4: I guess partly the way I'm looking at it is, I picked those [points to cluster on paper sheet] - I'd probably be storing those in a slightly different way to how I'd be storing those [gene cluster] - you know, your chromosomes are human and you've only got a smallish number, so even if you included [unclear], you're still looking at hundreds, whereas genes, we'd got so much data, we'd store it differently. 0:16:31.2 Y: okay, yup, that does make sense you have a finite number of organisms, finite number of publications, finite number of chromosomes, whereas, when you get to the nitty gritty -t here's _so much data_. 0:16:47.9 Participant 4: Yep 0:16:49.6 Y: Okay, right, I think I am gonna call that task done. There is one other very short task, and once we’re done with that I need to get a short backgrounds survey, so I can figure out how to categorise you - um - and then we can go to questions. Right, so I have three files, or rather I have the start of three files. So, I'll give them to you one by one - and all I want 0:17:13.4 Participant 4: Shall I....put these away? 0:17:15.4 Y: Yep, it's fine to clear them away. Ummm, so, there are some items that are highlighted in blue. I want to know whether those map, in your opinions, to any of the cards that you have. 0:17:27.7 Participant 4: oh, okay, cool 0:17:28.7 Y: so, there’s three different files. So, the first file we're looking at is homo sapiens.gff. 0:17:36.8 Participant 4: Okay, cool. Um. so, I don't really know if I've worked with a GFF file before. Ummm. DB cross reference - so we're looking at gene IDs, so.... [hunts through cards] where did that gene go? HGNC? so that's gene nomenclature. Umm, again, I guess that's ... so we'd be looking at gene accession.... gene.... these are both accessions from different databases and the name - that's an identifier. So, I'd call that an accession, an accession from a different database, and then an identifier. or a name. is what I would call those. Is that? 0:18:31.5 Y: Yeah, nah, that is perfect, that is beautiful. That's the only thing in that file. So the next file we're looking at is - flybase_d_melangoaster.gaf. 0:18:45.5 Participant 4: cool. so... 0:18:53.5 Participant 4: there you've got a GO term here. [sound of sorting cards] 0:19:08.9 Participant 4: the second one looks again like it's some sort of. Oh, sorry, the first one, um... looks like some sort of accession. That would be presumably flybase. “GN” - ahhh - could be gene identifier? but it's not very helpful. looking at the ah, header in the file. Um, yeah, and we could, um.... 0:19:37.3 Y: Okay, cool, fantastic. Final file is... ah, homosapiens.gene_info 0:19:56.0 Participant 4: Yeah, so the first one here is a tax ID, so.... do I actually have to find the card? Y: It's alright Participant 4: Organism Y: Yeah, you know we have organism 0:20:05.6 Participant 4: Yeah. Organism, one is gene identifier... umm, unusual to see an identifier starting with one like that. Unusual to see an identifier starting with 1 like that. Y: [laughs] "The very first gene" 0:20:22.6 Participant 4: usually I'd expect it to be 0-padded or something, at least. Ummm, yeah. Then we've got gene symbol, ah, then the last one is ummm... pipe-separated, so you've got a MIM identifier. That's, ummm... Mendelian inheritance of disease. HGNC, so again, that's a gene identifier, emsembl - that one is an ENSG, so that one's also a gene identifier. Yeah. 0:21:08.7 Y: Cool, awesome, thank you very much. You passed -no - [laughs) oh, I can stop now, so we're done with the recording. That was very, very [track ends].