2023-2024 Project
Guidelines
Schedule
The project should be done in groups of 4 peoples. The project’s deadline is December 12, 2023, from 13:15 to 14:45 in room TD5.127. Each group will be required to make an oral presentation.
You will have the opportunity to ask questions about the project on November 21, 2023, from 13:15 to 14:45 in room TD36.209. Groups should be formed on this date, and various tasks should be assigned to team members.
Structure
The primary goals of this project are as follows:
a GitHub repository containing all the code and description of your project.
Organize your project structure as illustrated below, with
pygenserving as the project’s name:
project_root_dir/
├── pygen/
│ ├── __init__.py
│ ├── words/
│ │ └── __init__.py
│ └── convert/
│ └── __init__.py
├── Readme.md
└── .gitignoreThe
Readme.mdfile should include a comprehensive project description and provide a minimal working example that demonstrates the functionality of the code.Create slides for a brief presentation lasting 15 minutes, with an additional 5 minutes allocated for questions. The slide file should be included in the project repository.
These objectives will help ensure the successful completion and presentation of your project.
Grading
The project repository must show a balanced contribution between group members and intra-group grade variation could be made to reflect issues on the intra-group workload balance.
| General | Details | Points |
|---|---|---|
| Git | Git repo (branch, commit, …) | 3 |
| Code | Structure of the code | 1 |
| Description / Documentation | 2 | |
| Work description | Clarity / Details | 2 |
| Answer of Projects goals | 8 | |
| Oral | Slides | 2 |
| Oral presentation | 2 | |
| Total | 20 |
The Git component will assess your group’s capacity to collaboratively merge individual contributions. Individuals who have not made any contributions in Git will incur a penalty.
English Words in the Human Proteome
The objective of this first project is to discover if English words can be found in the sequences of the human proteome, i.e., in the sequences of all human proteins.
Amino Acid Composition
In the first step, create 5 English words using the 20 amino acids.
Words
Download the file
english-common-words.txt. This file contains the 3000 most common English words, with one word per line.Create a script
words_in_proteome.pyand write the functionread_words()to read the words from the file provided as an argument to the script and return a list containing the words converted to uppercase and composed of 3 characters or more.In the main program, display the number of selected words.
Proteins
Download the file
human-proteome.fasta. Note that this file is quite large and comes from the UniProt database from this page.Here are the first lines of this file (the
[...]indicates a cut we made):
sp|O95139|NDUB6_HUMAN NADH dehydrogenase [ubiquinone] 1 beta [...]
MTGYTPDEKLRLQQLRELRRRWLKDQELSPREPVLPPQKMGPMEKFWNKFLENKSPWRKM
VHGVYKKSIFVFTHVLVPVWIIHYYMKYHVSEKPYGIVEKKSRIFPGDTILETGEVIPPM
KEFPDQHH
sp|O75438|NDUB1_HUMAN NADH dehydrogenase [ubiquinone] 1 beta [...]
MVNLLQIVRDHWVHVLVPMGFVIGCYLDRKSDERLTAFRNKSMLFKRELQPSEEVTWK
sp|Q8N4C6|NIN_HUMAN Ninein OS=Homo sapiens OX=9606 GN=NIN PE=1 SV=4
MDEVEQDQHEARLKELFDSFDTTGTGSLGQEELTDLCHMLSLEEVAPVLQQTLLQDNLLG
RVHFDQFKEALILILSRTLSNEEHFQEPDCSLEAQPKYVRGGKRYGRRSLPEFQESVEEF
PEVTVIEPLDEEARPSHIPAGDCSEHWKTQRSEEYEAEGQLRFWNPDDLNASQSGSSPPQIn the
words_in_proteome.pyscript, write the functionread_sequences()that will read the proteome from the file provided as the second argument of the script. This function will return a dictionary with protein identifiers as keys (e.g., O95139, O75438, Q8N4C6) and their associated sequences as values.In the main program, display the number of sequences read. For testing purposes, also display the sequence associated with the protein O95139.
Searching for Words
- Write the function
search_words_in_proteome()that takes as arguments the list of words and the dictionary containing protein sequences. This function will count the number of sequences in which a word is present. The function will return a dictionary with words as keys and the number of sequences containing these words as values. The function will also display the following message for words found in the proteome:
ACCESS found in 1 sequences
ACID found in 38 sequences
ACT found in 805 sequencesThis step may take a few minutes. Please be patient.
The Most Frequent Word
- Finally, write the function
find_most_frequent_word()that takes the dictionary returned by the previous functionsearch_words_in_proteome()as an argument. This function will display the word found in the most sequences and the number of sequences in which it was found in the following format:
=> xxx found in yyy sequencesWhat is this word?
What percentage of the proteome sequences contain this word?
Being More Comprehensive
Up to this point, we have determined, for each word, the number of sequences in which it appears. We could go further and also calculate how many times each word appears in the sequences.
To do this, modify the
search_words_in_proteome()function to count the number of occurrences of a word in the sequences. The.count()method will be useful.Determine which word is the most frequent in the human proteome.
GenBank to FASTA Converter
This project involves creating a file converter from GenBank format to FASTA format. The web page Some Data Formats Encountered in Biology provides details about these two file formats.
The dataset we will work with is the GenBank file of the yeast Saccharomyces cerevisiae’s chromosome I. The original file may be found at this url.
Reading the File
Create a script
genbank2fasta.pyand write the functionread_file()that takes the file’s name as an argument and returns the file’s content as a list of lines, where each line is a string.Test this function with the GenBank file
NC_001133.gbkand display the number of lines read.
Extracting the Organism Name
In the same script, add the function extract_organism() that takes the content of the previously obtained file (as a list of lines) with the read_file() function and returns the name of the organism. To retrieve the correct line, you can check if the line’s first characters contain the keyword ORGANISM.
- Test this function with the GenBank file
NC_001133.gbkand display the organism’s name.
Finding Genes
In the GenBank file, sense genes are indicated like this:
gene 58..272or
gene <2480..>2707and antisense genes (or complementary) like this:
gene complement(55979..56935)or
gene complement(<13363..>13743)The numeric values separated by .. indicate the gene’s position in the genome (first base number, last base number).
Locate these different genes in the NC_001133.gbk file. To retrieve these gene lines, check if the line begins with
gene (i.e., 5 spaces, followed by the word “gene,” followed by 12 spaces). To determine if it’s a sense or antisense gene, check for the presence of the word complement in the line read.
- Then, if you wish to retrieve the start and end positions of the gene, we recommend using the
replace()function and keeping only the numbers and... For example,
gene <2480..>2707will be changed to
2480..2707Finally, using the .split() method, you can easily retrieve the two integer values for the start and end of the gene.
- In the same
genbank2fasta.pyscript, add the functionfind_genes()that takes the file’s content (as a list of lines) as an argument and returns a list of genes.
Each gene will itself be a list containing the first base number, the last base number, and a string “sense” for a sense gene and “antisense” for an antisense gene.
- Test this function with the GenBank file
NC_001133.gbkand display the number of genes found, as well as the number of sense and antisense genes.
Extracting the Nucleotide Sequence of the Genome
The genome size is indicated on the first line of a GenBank file. Find the genome size stored in the NC_001133.gbk file.
In a GenBank file, the genome’s sequence is located between the lines
ORIGINand
//Find the first and last lines of the genome sequence in the
NC_001133.gbkfile.To retrieve the lines containing the sequence, we suggest using an algorithm with a flag
is_dnaseq(which will be eitherTrueorFalse). Here is the proposed pseudo-code algorithm:
is_dnaseq <- False
Read each line of the gbk file
If the line contains "//"
is_dnaseq <- False
If is_dnaseq is True
Accumulate the sequence
If the line contains "ORIGIN"
is_dnaseq <- TrueAt the beginning, this flag will have the value False. Then, when it becomes True, you can read the lines containing the sequence, and when it becomes False again, you will stop.
- Once the sequence is retrieved, simply remove the numbers, carriage returns, and other spaces (Tip: calculate the length of the sequence and compare it to the one mentioned in the gbk file).
Still in the same genbank2fasta.py script, add the function extract_sequence() that takes the file’s content (as a list of lines) as an argument and returns the nucleotide sequence of the genome (as a string). The sequence should not contain spaces, numbers, or carriage returns.
- Test this function with the GenBank file
NC_001133.gbkand display the number of bases in the extracted sequence. Verify that you have not made any errors by comparing the size of the extracted sequence with the one found in the GenBank file.
Creating the Reverse Complementary Sequence
Still in the same script, add the function construct_comp_inverse() that takes a DNA sequence as a string and returns the reverse complementary sequence (also as a string).
Remember that constructing the reverse complementary sequence of a DNA sequence involves:
- Taking the complementary sequence, which means replacing base
awith baset,twitha,cwithg, andgwithc. - Taking the reverse, which means that the first base of the complementary sequence becomes the last base and vice versa, and so on.
To make your work easier, only work with lowercase sequences.
- Test this function with the sequences
atcg,AATTCCGG, andgattaca.
Writing a FASTA File
Still in the same script, add the function write_fasta() that takes a file name (as a string), a comment (as a string), and a sequence (as a string) as arguments, and writes a FASTA file. The sequence should be written in lines no longer than 80 characters.
As a reminder, a FASTA file follows the following format:
>comment
sequence on a line with a maximum of 80 characters
sequence continuation .........................
sequence continuation .........................
...Test this function with the following:
- File name:
test.fasta - Comment:
my comment - Sequence:
atcgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcg
- File name:
Extracting Genes
Still in the same script, add the function extract_genes() that takes the list of genes, the complete nucleotide sequence (as a string), and the organism’s name (as a string) as arguments. For each gene:
- Extract the gene’s sequence in the complete sequence.
- Take the reverse complementary sequence (using the
construct_comp_inverse()function if the gene is antisense). - Save the gene in a FASTA format file (using the
write_fasta()function). - Display on the screen the gene number and the name of the created FASTA file.
The first line of the FASTA files will be in the following format:
>organism-name|gene-number|start|end|sense or antisenseThe gene number will be consecutive from the first gene to the last. There will be no numbering difference between sense and antisense genes.
- Test this function with the GenBank file
NC_001133.gbk.
Building the Final Script
To finish, modify the genbank2fasta.py script so that the GenBank file to be analyzed (in this example, NC_001133.gbk) is entered as an argument of the script.
You will display an error message if:
- The
genbank2fasta.pyscript is used without an argument. - The file provided as an argument does not exist.
To help you, you may read the documentation of the sys and os modules.
Test your finalized script.