Videos | 4273pi – Bioinformatics Education Project

Watch videos that will give you an overview of the 4273pi project, introduce the NCBI database, and go through our bioinformatics workshops.

Introduction to the 4273pi Project

This video provides an outline of the project as of July 2023. It was recorded to accompany a poster presented at the Intelligent Systems For Molecular Biology (ISMB) 2023 conference.

4273pi poster for the ISMB 2023 conference (3.12 MB / PDF)

This 7-minute video provides an outline of the 4273pi project.

Hi there.

My name is Richard Fitzpatrick. I am a Postdoc in bioinformatics education at the University of Edinburgh. I'm delighted to present my poster introducing Scottish school students to bioinformatics through the 4273 Pi project. I don't have time in this video to cover everything, so if you want to find out more, please do catch me during the conference virtually or contact me on the details that you can see below. Thank you.

The project is not just me. I'm the workshop leader currently for the 4273 Pi project, but the team is much broader than that. Daniel Barker was the one that helped originate this way back in about 2016 as a public engagement project in Scotland. You can see we still have members in St. Andrews where they started. We are mainly in Edinburgh, but we also have people in Glasgow. If people want to take part and want to find out more about taking part in this project, please do get in touch with me. We'd be delighted to have you.

Our aims are broadly to reach all Scottish secondary schools and deliver workshops or teacher training to help these schools understand bioinformatics a little bit more than they currently perhaps do. There's currently bioinformatics on the Scottish school curriculum for Higher Biology and Human Biology, which is aimed at about 16 to 18-year-olds. But we find that teachers often either don't have the background in bioinformatics or the confidence in explaining bioinformatics to school students in a way which makes sense.

What we do is we either deliver in-person student workshops around bioinformatics at two different levels, which I'll talk about in a minute, or we deliver teacher training to help teachers to deliver the materials that we've already made for those workshops, and to help them embed parts of that into their own teaching practice. We provide them with the lesson plans, our teaching materials, our worksheets, and they can use that to fit into their own teaching wherever they may be.

In Scotland, we've reached 50% of Scottish secondary schools with this approach so far. We hope to reach 75% over the coming few years. We're really looking forward to seeing how far we can take this project.

Talking about reach, you can see that we've reached all the way up to the Shetland Isles, all the way down to the borders of England. This reach is quite wide-reaching as it is the central belt there, which you can see in the box shows you, this is the most densely populated area of Scotland. You can see that we've got a good reach there, but we still have a lot to do.

The reason why we perhaps haven't always visited places in those areas is there's something called the Scottish Index of Multiple Deprivation, which factors into our delivering of workshops. We put calls out online on a mailing list called SYNAPSE, and it's usually quite popular. People are really interested in taking our workshops on. For especially these workshops, we have a lot of repeat interest as well.

We're trying to focus on schools which are low on this Scottish index of multiple deprivation (SIMD), which is quite complicated. But it helps us measure areas where the level of opportunities for various things are perhaps lower than elsewhere. It tends to be targeting areas and reaching pupils which perhaps don't always get the level of university interaction, for example, as others in the higher levels would.

We have three workshops, two which we currently do quite a lot of, and one which we're just finishing developing and we're hoping to take into schools in the coming autumn.

Food Detective is aimed at younger students. This is aimed at the national five level, which is about 15 years old. At this level, they know what DNA is and they know something about the central dogma, but not massive amounts about DNA, and especially nothing really about bioinformatics. We generally introduce this, we have this narrative where we have a handmade pork sausage which we sequenced. It was real many, many years ago and there's a series of DNA barcodes that we ask them to run BLAST searches on and see where they've come from, and then come up with hypotheses about how these particular barcodes for these particular animals came to be found in this pork sausage. And we have a lot of fun with that. It uses freely available resources and we can do it on any device. This makes it quite portable and quite easy for teachers to also embed this in their own practice if they wish.

The second workshop is more tightly linked to the curriculum, and so is perhaps the one most in demand. This is aimed at the highers level, this is 16-18 year-olds. As I've said, it focuses on something called the Gulo gene, which is involved in vitamin C synthesis. We provide a case study where we ask them to go and find out this unknown protein. Do a BLAST search, find out what it is, find out what organism it comes from, and then see whether humans have something similar. We do a DNA sequence alignment using the BLAST website, and they find that there is a pseudogene as a result of a series of substitution, insertion and deletion mutations. We can talk to them about that knowledge, which they already have as part of their degree. We talk about frameshifts. The human gene has a series of frameshift mutations. We can talk to them about what the implications of those have been for the actual protein that possibly is made.

And then the evolutionary chatter that we can have about that. Why is it that humans didn't die out if we still need vitamin C? We talk about things like diet and other organisms as well if we have time. And then what we do is we replicate that using the Raspberry Pi computer. These are little mini portable computers we take into the schools. It exposes the students to a Linux environment and a command-line environment. We can talk to them about how bioinformaticians tend to use that because of how we are using data and processing data and the size of our datasets. We tend to have volunteers come along who are bioinformaticians at various different levels. All the way from master students and PhD students all the way up to heads of bioinformatics departments. The students get the chance to talk to a lot of people with very different backgrounds as well coming into bioinformatics, which people find quite interesting.

We have a brand new workshop that we're hoping to deliver as well, which is based on PCR, which is something which is also on the curriculum. We're focusing on primer design, which is the most bioinformatics element to it, but I'm trying to introduce a lot of student choice in this. There's going to be six different case studies which are Scottish-specific case studies around various animals. Students are going to be able to pick how the workshop is going to go. They'll be able to choose the case study they want to look at, which will all have different slight problem-solving activities. And then we'll come together as a group and focus on one in more detail. Each workshop is going to be different, a lot of student choice.

And we're doing that because we're finding that we need to change and evolve with students as time goes on. This is something which is becoming quite important as different skills are being developed in students that we are not necessarily capitalizing on.

I'm very much interested in game-based learning and using games in teaching. We've got in development some areas to help visualize some of that sequence alignment data I talked about in the second workshop using Minecraft.

Not got any more time than that. Thank you very much for listening. If you want to catch me at the conference, please do, and thank you for coming along to the poster.

Introduction to the NCBI database and keyword searches

This video gives an introduction to the NCBI database and explains how to do a keyword search for a gene or protein of interest.

This 5-minute video gives an introduction to the NCBI database.

Welcome to Bioinformatics for Biologists, I’m Stevie Bain, a researcher from the University of Edinburgh.

In this video we are going to introduce the NCBI database, that’s the National Center for Biotechnology Information, and explain how to do a keyword search for a gene or protein of interest. You can access the website at the address below: ncbi.nlm.nih.gov

The database has a search bar that allows the user to search using keywords, similar to the way that you would use a web search engine. At the left-hand side there is drop down menu that lets you choose which specific database you would like to search, for example nucleotide or protein. Alternatively, you can search across all databases. Let’s run through an example, imagine we are interested in the enzyme catalase, we type ‘catalase’ into the search box and hit search.

After a few moments, we should see a results page that looks like this. As you can see we have our search results for catalase categorized by database. We have some results in literature which include books and scientific journals. We have results in genes, we have some results in proteins that can be divided into conserved domains and clusters. We have some results in genomes and some results in genetics. We also have results in chemicals. The blue boxes next to each database tell us how many results we have in each. If we look at proteins, we can see that we have just under 400,000 search results.

If we click on protein, we are taken to a page that shows us the search results for catalase in the protein database. As we can see, there are just under 400,000 results which is around 20,000 pages. Our first result is catalase from Drosophila melanogaster. If we look underneath, we can also see it has 506 amino acids. We can click on this description to find out more.

When we click on this description, we are taken to a page that gives us some more information about our protein search result. We can see the NCBI reference sequence which contains the accession number the unique id for this sequence. We can also find out more info such as the source of the protein which we know is Drosophila melanogaster, but we can also find out the common name and a bit more about the organism’s taxonomy.

Here we can also find scientific literature related to the result. If we scroll back to the top, we can see FASTA. If we click on this, it takes us to the FASTA sequence of the protein. This first line is the defline - it contains information about the protein name and the species it comes from. Underneath we have the amino acid sequence, each amino acid is represented by one letter for example M for Methionine.

We can also conduct more specific searches. Say, for example, we wanted to search for the human catalase protein. We would select protein from the dropdown menu and type catalase in the search bar; but this time we would use the Boolean operator capital AND followed by the species name, in this case Homo sapiens. We would then type square brackets ORGN.

This time our results page is specifically showing matches in the protein database for catalase in Homo sapiens. As you can see, we have much fewer results - only 91 compared to around 400,000 in the last search. Each of these results has the name catalase followed by Homo sapiens in square brackets. Our first result has 527 amino acids.

When we click on the description of our first result, we are taken once again to a page that gives us some more information about this protein. This includes the accession number and some more information about the source including taxonomical information. Once again, we also find scientific literature related to this result. Let’s click on FASTA and take a look at the amino acid sequence.

Here we have the amino acid sequence of catalase in FASTA format with the defline giving a description at the top and the amino acid sequence underneath.

We hope you found this overview of the NCBI database and instructional video on how to do keyword searches useful. If you would like some more information about our project, please visit our website at 4273pi.org, you can also follow us on Twitter @4273pi.

Using the NCBI database to visualise the 3D structures of proteins

This video explains how to visualise the 3D structures of proteins using the NCBI database.

This 7-minute video explains how to visualise the 3D structures of proteins using the NCBI database.

Welcome to Bioinformatics for Biologists. I’m Stevie Bain, a researcher from the University of Edinburgh.

In this video, we are going to use the NCBI (National Center for Biotechnology Information) to visualize 3-dimensional protein structures. Firstly, we need to navigate to the NCBI homepage by typing ncbi.nlm.nih.gov into the address bar of your web browser.

The NCBI homepage has a search bar that allows us to search the databases using a keyword search. In this video, we are going to focus on the 3-dimensional protein structure database. So, we click on the dropdown menu to the left of the search bar and select ‘Structure’. Now we type the name of the protein whose structure we want to search for in the search bar. In this video, we’ll type deoxyhaemoglobin AND Homo sapiens [ORGN].

The addition of [ORGN] after Homo sapiens means that we only want to search for deoxyhaemoglobin found in the species, Homo sapiens. It basically allows us to perform a more specific search of the database.

We then click search, and in a few moments, we should see our results page. We can see that we have 106 results for deoxyhaemoglobin in Homo sapiens in this database. For each of these results, the line in blue is the description, it tells us the full name of the structure. And importantly, if we look underneath this we can see “taxonomy” which tells us what species this structure comes from.

When using this database, it is important to take time look through the descriptions of results to ensure that you find the most suitable structure. You can always refine your search terms and search again if necessary. If we click on the description line of our chosen result it will take us to a page with more details. We can see the name of the structure and also some information about the scientific literature associated with this database entry.

When we scroll down, we can see an image of the structure and also the molecular components. In this example, we have 4 protein chains: 2 that are Hemoglobin S alpha chains and 2 that are Hemoglobin S beta chains. Underneath the chain name, we have the gene symbol for each. If we click on this, we are taken to a page that gives us some info about the gene.

For example, we can see the HBA1 gene is located on chromosome 16.

On this page, we also have some more information about the chemicals and molecules that bind to these proteins. Here we can see that we have 4 molecules that contain iron - these are the haem groups. Once again, we can click on this description to get more information about this structure.

If we want to look at this deoxyhaemoglobin protein structure in more detail, we can scroll back up and click full-featured 3D viewer. This may take a while to load but it’s worth having patience as this feature will allow us to interact with a 3D view of our structure. Here we can hover the mouse over the structure to see which amino acids are in the protein.

If we click down on the structure, we can also move it to focus on different parts. In this viewer, we can also very easily visualize secondary structures, for example, alpha helices are shown as curled ribbons. On the right-hand side, we can once again see each of the protein chains in this structure and the molecules.

On the right-hand side, each of the protein chains and their associated conserved domains are represented by colored boxes. If we wish to highlight any of these regions in the 3D viewer, we simply click on the colored box and a yellow highlight will appear around that area. If we wish to remove this highlight, we simply go to the toggle at the top that says ‘selection’ and click.

As you can see this selection has now been removed so we can click on a different protein chain.

At the top of the viewer, there are a number of menus. One particularly useful menu is the ‘color’ menu. This allows us to color the 3D structure based on particular properties or features. Right now, ‘chain’ is selected. That means that each of the protein chains in the structure is uniquely colored. Let’s change this to charge.

Now we can very clearly see areas of the structure that have different charges: negatively charged areas are red, positively charged areas are blue and those that are grey are neutral. Once again, we can use the boxes at the side to make specific selections.

We can also change the style of the structure. If we go to proteins, you will see that there are many options that allow us to change how the structure is styled.

We can change this to lines for example or show the proteins as spheres. Let’s go back to proteins and change their style to ribbon. Then we can then move along to the side chains option - which you will see right now is hidden - and change this to show the side chains.

There are many options in this viewer, and we recommend that you have a go at exploring these yourself.

We hope you found this video useful. If you would like some more information about our project and our free resources: 4273pi.org or @4273pi

Bioinformatics: Food Detective - National 4/5 level Biology workshop

This is a tutorial video for Bioinformatics: Food Detective, a bioinformatics workshop designed for secondary school biology pupils.

In this 9-minute video, we run through our Bioinformatics: Food Detective workshop.

Hello and welcome to Bioinformatics for Biologists. I’m Stevie Bain, a researcher from the University of Edinburgh. In this video, we are going to run through how to do the BLAST search required for our workshop, Bioinformatics: Food Detective.

It will be useful to have the accompanying handouts available as we progress through this short video. You can find all the web pages you need to access the BLAST search tool and the DNA sequences required in the handout.

In this workshop, we use the NCBI database, the NCBI BLAST tool and DNA barcodes to identify species. We also interpret BLAST e-values and the reliability of our search results.

For this activity, we DNA sequenced a sausage described as 100% pork to produce a number of DNA barcode sequences. DNA barcodes are regions of DNA that are common to all animals but vary between species. What we aim to do here is identify which species we find in this pork sausage based on these DNA sequences. We are also interested in analyzing the reliability of our results.

This is a subset of the DNA barcodes we use in this workshop. These can be found at 4273pi.org/schools.

Follow the instructions on the handout to access the BLAST homepage. Here you will see that there are a number of different BLAST search tools. These compare nucleotide or protein sequences to sequences in the database and calculate the statistical significance. For this activity, we need the nucleotide BLAST tool. This will search for our nucleotide barcodes in the NCBI nucleotide database.

When we click nucleotide BLAST, we are taken to a page that looks like this. Now we have to access our DNA barcode sequences and paste them into this large box at the top.

As mentioned previously, we access these sequences at 4273pi.org/schools. You will find the sequences under the National 4/5 Biology workshop. When we click here you will see a page that looks like this. This has all the sequences we need for both tasks in this workshop. We simply highlight the first sequence and copy.

Now we need to paste our sequence into the box at the top and scroll down to where it says program selection. Here we want to choose the option blastn as we are running a nucleotide blast. This ensures a more sensitive search of the database. We then click BLAST and our search will begin to run. This may take a few moments.

Our results page looks like this. At the top, we have some information about the BLAST search we have just run. If we scroll down, we see a table titled sequences producing significant alignments. Here, all of our BLAST search results are listed. Under description, we find the names of the sequences in the database that best match sequence A. Importantly, this includes the species name. One thing to note is that BLAST uses the scientific or Latin names of species, not the common names.

The table is ordered so that our best BLAST result is in the first row. This order is specified by e-value – a statistic that describes the number of hits one can expect to see by chance when searching a database of a particular size. We will discuss this more in a moment. Although we don’t directly ask for these other values in the worksheet: query coverage and percentage identity are also important to consider when doing a BLAST search.

Query coverage is the percentage of our sequence – sequence A – that aligns to the sequence in the database. Percent Identity relates only to aligned regions and describes how similar the query sequence is to the sequence from the database i.e. how many bases in each sequence are identical?

For this activity, we are most interested in the e-value as it allows us to determine how reliable our search results are. As we progress through the workshop, we compare the e-values we find in Task 1 to those we find in Task 2 in order to identify which set of results is most reliable. The e-value is how many times we would expect to see a match of this quality, between our sequence and the sequence in the database, by chance.

If our e-value is high, we consider the match to be unreliable. If the e-value is low, we consider it reliable. If our e-value is 0, the lowest an e-value can be, that means the match is extremely reliable. There are no definite cut-offs for e-value, therefore in this activity we simply compare the e-values in one table to those in another.

The BLAST output provides e-values in a way that you may not be familiar with, for example, 5.2e-15. This is simply a way of representing numbers with lots of digits without taking up too much space.

You might like to pause this video and work through Task One of the worksheet now.

Let’s take a look at our first table of results: sequences A-H. Please note that the results you obtain may not be identical to ours, due to continual growth of the DNA database. Since these sausages are 100% pork, we may expect to only find pig DNA in our sample, however this is not the case as we find chicken, sheep, cattle and perhaps more surprisingly human DNA.

If we take a look at the e-values, we can see that they are very low numbers. So, for example, in the top row 4e-49 is the same as 4 times 10 to the power of minus 49. These low numbers suggest that the results in this table are reliable.

Indeed, there are good explanations for how DNA from animals other than pig may have found its way into our samples.

The questions posed throughout this workshop should prompt discussion about food fraud and DNA contamination, but also the DNA extraction and sequencing processes. The other animal DNA most likely came from DNA cross-contamination from the butcher shop. The human DNA most likely came from the sausage making process or the lab.

Even if benches and tools are thoroughly cleaned, traces of DNA can still linger behind and these will be picked up when the DNA is sequenced. So, although the presence of human DNA may be alarming at first, it is most likely due to human interaction with the sausage or the DNA sample. These are examples of DNA cross-contamination, not food fraud.

You may now like to pause the video here to complete Task 2.

In the second task, we find a more unexpected set of results. For example, frog, grapevine and bacteria. However, the e-values in this table are rather high. When we compare this table to the first table, we can reinforce the concept of using BLAST e-values as an indicator of result reliability. We said before that the lower the e-value the more reliable the match. We find that the results in Task 1 have lower e-values than those in Task 2 by many orders of magnitude. Hence, the results in Task 1 are more reliable. This makes sense when we look at the species we find in each table.

We hope you found this video on Bioinformatics: Food Detective useful. For more information and more free resources, visit our website 4273pi.org and follow us on Twitter @4273pi.

Bioinformatics: The Power of Computers in Biology - Higher/Advanced Higher level Biology workshop

This is a tutorial video for our Bioinformatics: The Power of Computers in Biology workshop designed for 16-18 year old biology students.

In this 10-minute video, we run through our Bioinformatics: The Power of Computers in Biology workshop.

Welcome to Bioinformatics for Biologists. I’m Stevie Bain, a researcher from the University of Edinburgh. In this video, we are going to run through some of the key activities for Tasks A and B in our workshop, Bioinformatics: The Power of Computers in Biology. It will be useful to have the accompanying handouts available as we progress through this short video. These can be found at 4273pi.org/teacher-resources. To complete these tasks, we will use the NCBI database – specifically the BLAST search tool.

This workshop provides an opportunity to gain practical experience in bioinformatics and highlights the link between DNA sequencing and computation. Here, we use a bioinformatics tool – the BLAST search tool - to explore mutations, evolution and nutrition.

Let’s begin with Task A – the identification of ‘mystery’ sequence R using the NCBI BLAST tool. First, we must access the BLAST search tool on the NCBI website – you can find the address for this website in the accompanying handout. Once here we can see that there are a number of different BLAST search tools. The BLAST program compares nucleotide or protein sequences to sequence databases. The program we need for this task is BLASTx. This will search for matches to our nucleotide sequence in the NCBI protein database.

We click on BLASTx and this takes us to our BLAST search form. It is here that we need to paste in Sequence R. But first, let’s retrieve sequence R.

Sequence R can be retrieved on the 4273pi website – 4273pi.org – under the heading schools. Look for Bioinformatics: The Power of Computers in Biology and click on the link. We highlight and copy the entire sequence including the defline at the top. We then go back to the BLAST search page and paste sequence R into the large box at the top where it says: “Enter Query Sequence”. We leave all other settings as default and click BLAST.

This BLAST search should only take a few moments. The important thing is not to refresh the page. You may want to pause the video here and run your own BLAST search for Task A.

When the BLAST search is complete, we see a results page that looks like this. In the top section of the page we have some information about the query sequence – in this case, sequence R – and the search that we’ve just conducted. For the purposes of this exercise, we are most interested in the table lower down the page titled ‘Sequences producing significant alignments’. This table contains the proteins in the database that best match sequence R.

This table is ordered so that our best result is in the top row. The order is determined by the E-value a statistic that describes the number of hits one can expect to see by chance when searching a database of a particular size. Therefore, the lower this number the more reliable the match. Zero is the most reliable a match can be. Under the description column we find the name of the protein and in square brackets the species that protein comes from.

For this activity, we want to find out more about our best BLAST result. So, we can click on the name in the description column and it will give us more information about this match. At the top, we can see the name of the protein, L-gulonolactone oxidase, and the species it comes from, Mus musculus. If we want to find out more, we can click on sequence ID.

On this page, we get some detailed information about this protein including the accession number which is how the NCBI database is catalogued. This page also has information about the source of the protein – including the common name of the species. If we scroll down to ‘source’ we see Mus musculus also known as the house mouse. There is also a list of scientific literature related to this protein. By looking at the titles here we can begin to understand the biological role of this protein. However, we also recommend doing a web search of the protein name for a more concise description of its function.

Now let’s go back to the previous page and take a look at the alignment. An alignment is basically a way of arranging the query sequence – in this case sequence R – with a sequence in the database – called the ‘subject’ - to identify regions of similarity that may be a consequence of structural, functional and evolutionary relationships. Everywhere we see ‘Sbjct’ that’s the sequence from the database. Everywhere we see ‘Query’ that’s Sequence R. In between the query and the subject, we have the consensus sequence.

We will take a look at sequence alignments in much more detail in Task B. We now know that Sequence R codes for the mouse GULO gene that codes for the protein L-gulonolactone oxidase - an enzyme involved in vitamin C production. We now want to find out if humans also have a functional copy this gene or a non-functional pseudogene.

To do this we go back to the BLAST homepage and look for the BLAST genomes heading. Here we see a text box with a few species’ names underneath. We click on human and this takes us to a BLAST search tool similar to the one we used previously, however, this time, instead of BLAST-searching the whole database, this tool will specifically search for sequence R in the human genome database.

Again, we paste sequence R into the box at the top where it says enter query sequence and this time, we optimize for somewhat similar sequences in the program selection. This is because we are now doing a nucleotide BLAST – BLAST searching for a nucleotide sequence in a nucleotide database and we need to run a more sensitive search. We then click BLAST and wait for our results. You may want to pause the video here and run this search.

Once again, we patiently wait a few minutes for the results page to appear. When it does, we can see that the top of the page tells us some more information about the search we have just conducted. If we scroll down to the table of ‘Sequences producing significant alignments”, we will see the best matches for sequence R in the human genome. As with the previous search, the best match is in the first row.

In the Description column, we can see that our best matching sequence is on Homo sapiens chromosome 8. If we look at the e-value, we can see that this is a very low number, 6e-45. This is the same as 6x10-45. This is a reliable match.

Let’s click on the name in the description column to view the alignment of sequence R with the matching sequence in the Homo sapiens database. As mentioned previously, an alignment is basically a way of arranging the query sequence with a sequence in the database to identify regions of similarity. These alignments allow us to spot mismatches and gaps between the two sequences that correspond to mutations.

The accompanying handouts explain in detail how BLAST annotates mutations in alignments. If we look along this first alignment here, we can see that where the base in the query matches the base in the subject, there is a little vertical line between the two. Where they do not match, there is no line – this is an example of a substitution mutation.

If we look at this point here, we see something a little different. In the subject sequence, instead of a base, we see a little dash or a hyphen. This represents an insertion or a deletion mutation. Here it is an example of a one-base frameshift mutation and therefore we can assume that humans have a ‘pseudogene’ - a segment of DNA that resembles a functional gene but has mutations that render it not functional.

But how do we know if this mutation represents an insertion in the mouse sequence or a deletion in the human sequences? Well, usually from this output alone, we would not be able to say. However, in this case, we do have prior knowledge about sequence R. We know that this codes for a functional gene in the house mouse. Therefore, this must be a deletion in the human sequence.

If we take a look at this part of the alignment, we have an insertion/deletion with a length of three bases. This is not an example of a frameshift mutation as three bases make up a codon. Removing three bases does not shift the reading frame of the sequence in the way that removing 1 or 2 bases does.

Now that we have completed Task A and Task B, this opens up the opportunity for discussion about the biological role of vitamin C, how humans can get vitamin C and what happens if they do not get enough.

This article was published on 2024-09-02