2023-2024 Project

Guidelines

Schedule

The project should be done in groups of 4 peoples. The project’s deadline is December 12, 2023, from 13:15 to 14:45 in room TD5.127. Each group will be required to make an oral presentation.

You will have the opportunity to ask questions about the project on November 21, 2023, from 13:15 to 14:45 in room TD36.209. Groups should be formed on this date, and various tasks should be assigned to team members.

Structure

The primary goals of this project are as follows:

a GitHub repository containing all the code and description of your project.
Organize your project structure as illustrated below, with pygen serving as the project’s name:

project_root_dir/
   ├── pygen/
   │     ├── __init__.py
   │     ├── words/
   │     │     └── __init__.py
   │     └── convert/
   │           └── __init__.py
   ├── Readme.md
   └── .gitignore

The Readme.md file should include a comprehensive project description and provide a minimal working example that demonstrates the functionality of the code.
Create slides for a brief presentation lasting 15 minutes, with an additional 5 minutes allocated for questions. The slide file should be included in the project repository.

These objectives will help ensure the successful completion and presentation of your project.

Grading

Important

The project repository must show a balanced contribution between group members and intra-group grade variation could be made to reflect issues on the intra-group workload balance.


General	Details	Points
Git	Git repo (branch, commit, …)	3
Code	Structure of the code	1
	Description / Documentation	2
Work description	Clarity / Details	2
	Answer of Projects goals	8
Oral	Slides	2
	Oral presentation	2
Total		20

The Git component will assess your group’s capacity to collaboratively merge individual contributions. Individuals who have not made any contributions in Git will incur a penalty.

English Words in the Human Proteome

The objective of this first project is to discover if English words can be found in the sequences of the human proteome, i.e., in the sequences of all human proteins.

Amino Acid Composition

In the first step, create 5 English words using the 20 amino acids.

Words

Download the file english-common-words.txt. This file contains the 3000 most common English words, with one word per line.
Create a script words_in_proteome.py and write the function read_words() to read the words from the file provided as an argument to the script and return a list containing the words converted to uppercase and composed of 3 characters or more.
In the main program, display the number of selected words.

Proteins

Download the file human-proteome.fasta. Note that this file is quite large and comes from the UniProt database from this page.
Here are the first lines of this file (the [...] indicates a cut we made):

    sp|O95139|NDUB6_HUMAN NADH dehydrogenase [ubiquinone] 1 beta [...]
    MTGYTPDEKLRLQQLRELRRRWLKDQELSPREPVLPPQKMGPMEKFWNKFLENKSPWRKM
    VHGVYKKSIFVFTHVLVPVWIIHYYMKYHVSEKPYGIVEKKSRIFPGDTILETGEVIPPM
    KEFPDQHH
    sp|O75438|NDUB1_HUMAN NADH dehydrogenase [ubiquinone] 1 beta [...]
    MVNLLQIVRDHWVHVLVPMGFVIGCYLDRKSDERLTAFRNKSMLFKRELQPSEEVTWK
    sp|Q8N4C6|NIN_HUMAN Ninein OS=Homo sapiens OX=9606 GN=NIN PE=1 SV=4
    MDEVEQDQHEARLKELFDSFDTTGTGSLGQEELTDLCHMLSLEEVAPVLQQTLLQDNLLG
    RVHFDQFKEALILILSRTLSNEEHFQEPDCSLEAQPKYVRGGKRYGRRSLPEFQESVEEF
    PEVTVIEPLDEEARPSHIPAGDCSEHWKTQRSEEYEAEGQLRFWNPDDLNASQSGSSPPQ

In the words_in_proteome.py script, write the function read_sequences() that will read the proteome from the file provided as the second argument of the script. This function will return a dictionary with protein identifiers as keys (e.g., O95139, O75438, Q8N4C6) and their associated sequences as values.
In the main program, display the number of sequences read. For testing purposes, also display the sequence associated with the protein O95139.

Searching for Words

Write the function search_words_in_proteome() that takes as arguments the list of words and the dictionary containing protein sequences. This function will count the number of sequences in which a word is present. The function will return a dictionary with words as keys and the number of sequences containing these words as values. The function will also display the following message for words found in the proteome:

ACCESS found in 1 sequences
ACID found in 38 sequences
ACT found in 805 sequences

This step may take a few minutes. Please be patient.

The Most Frequent Word

Finally, write the function find_most_frequent_word() that takes the dictionary returned by the previous function search_words_in_proteome() as an argument. This function will display the word found in the most sequences and the number of sequences in which it was found in the following format:

=> xxx found in yyy sequences

What is this word?
What percentage of the proteome sequences contain this word?

Being More Comprehensive

Up to this point, we have determined, for each word, the number of sequences in which it appears. We could go further and also calculate how many times each word appears in the sequences.

To do this, modify the search_words_in_proteome() function to count the number of occurrences of a word in the sequences. The .count() method will be useful.
Determine which word is the most frequent in the human proteome.

GenBank to FASTA Converter

This project involves creating a file converter from GenBank format to FASTA format. The web page Some Data Formats Encountered in Biology provides details about these two file formats.

The dataset we will work with is the GenBank file of the yeast Saccharomyces cerevisiae’s chromosome I. The original file may be found at this url.

Reading the File

Create a script genbank2fasta.py and write the function read_file() that takes the file’s name as an argument and returns the file’s content as a list of lines, where each line is a string.
Test this function with the GenBank file NC_001133.gbk and display the number of lines read.

Extracting the Organism Name

In the same script, add the function extract_organism() that takes the content of the previously obtained file (as a list of lines) with the read_file() function and returns the name of the organism. To retrieve the correct line, you can check if the line’s first characters contain the keyword ORGANISM.

Test this function with the GenBank file NC_001133.gbk and display the organism’s name.

Finding Genes

In the GenBank file, sense genes are indicated like this:

   gene            58..272

   gene            <2480..>2707

and antisense genes (or complementary) like this:

   gene            complement(55979..56935)

   gene            complement(<13363..>13743)

The numeric values separated by .. indicate the gene’s position in the genome (first base number, last base number).

Important

The < symbol indicates a partial gene at the 5’ end, meaning that the corresponding START codon is incomplete. Similarly, the > symbol denotes a partial gene at the 3’ end, indicating that the corresponding STOP codon is incomplete. For more details, refer to the NCBI documentation on gene boundaries. Here, we propose to ignore these > and < symbols.

Locate these different genes in the NC_001133.gbk file. To retrieve these gene lines, check if the line begins with

     gene

(i.e., 5 spaces, followed by the word “gene,” followed by 12 spaces). To determine if it’s a sense or antisense gene, check for the presence of the word complement in the line read.

Then, if you wish to retrieve the start and end positions of the gene, we recommend using the replace() function and keeping only the numbers and ... For example,

    gene            <2480..>2707

will be changed to

2480..2707

Finally, using the .split() method, you can easily retrieve the two integer values for the start and end of the gene.

In the same genbank2fasta.py script, add the function find_genes() that takes the file’s content (as a list of lines) as an argument and returns a list of genes.

Each gene will itself be a list containing the first base number, the last base number, and a string “sense” for a sense gene and “antisense” for an antisense gene.

Test this function with the GenBank file NC_001133.gbk and display the number of genes found, as well as the number of sense and antisense genes.

Extracting the Nucleotide Sequence of the Genome

The genome size is indicated on the first line of a GenBank file. Find the genome size stored in the NC_001133.gbk file.

In a GenBank file, the genome’s sequence is located between the lines

ORIGIN

and

//

Find the first and last lines of the genome sequence in the NC_001133.gbk file.
To retrieve the lines containing the sequence, we suggest using an algorithm with a flag is_dnaseq (which will be either True or False). Here is the proposed pseudo-code algorithm:

is_dnaseq <- False
Read each line of the gbk file
    If the line contains "//"
        is_dnaseq <- False
    If is_dnaseq is True
        Accumulate the sequence
    If the line contains "ORIGIN"
        is_dnaseq <- True

At the beginning, this flag will have the value False. Then, when it becomes True, you can read the lines containing the sequence, and when it becomes False again, you will stop.

Once the sequence is retrieved, simply remove the numbers, carriage returns, and other spaces (Tip: calculate the length of the sequence and compare it to the one mentioned in the gbk file).

Still in the same genbank2fasta.py script, add the function extract_sequence() that takes the file’s content (as a list of lines) as an argument and returns the nucleotide sequence of the genome (as a string). The sequence should not contain spaces, numbers, or carriage returns.

Test this function with the GenBank file NC_001133.gbk and display the number of bases in the extracted sequence. Verify that you have not made any errors by comparing the size of the extracted sequence with the one found in the GenBank file.

Creating the Reverse Complementary Sequence

Still in the same script, add the function construct_comp_inverse() that takes a DNA sequence as a string and returns the reverse complementary sequence (also as a string).

Remember that constructing the reverse complementary sequence of a DNA sequence involves:

Taking the complementary sequence, which means replacing base a with base t, t with a, c with g, and g with c.
Taking the reverse, which means that the first base of the complementary sequence becomes the last base and vice versa, and so on.

To make your work easier, only work with lowercase sequences.

Test this function with the sequences atcg, AATTCCGG, and gattaca.

Writing a FASTA File

Still in the same script, add the function write_fasta() that takes a file name (as a string), a comment (as a string), and a sequence (as a string) as arguments, and writes a FASTA file. The sequence should be written in lines no longer than 80 characters.

As a reminder, a FASTA file follows the following format:

>comment
sequence on a line with a maximum of 80 characters
sequence continuation .........................
sequence continuation .........................
...

Test this function with the following:
- File name: test.fasta
- Comment: my comment
- Sequence: atcgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcg

Extracting Genes

Still in the same script, add the function extract_genes() that takes the list of genes, the complete nucleotide sequence (as a string), and the organism’s name (as a string) as arguments. For each gene:

Extract the gene’s sequence in the complete sequence.
Take the reverse complementary sequence (using the construct_comp_inverse() function if the gene is antisense).
Save the gene in a FASTA format file (using the write_fasta() function).
Display on the screen the gene number and the name of the created FASTA file.

The first line of the FASTA files will be in the following format:

>organism-name|gene-number|start|end|sense or antisense

The gene number will be consecutive from the first gene to the last. There will be no numbering difference between sense and antisense genes.

Test this function with the GenBank file NC_001133.gbk.

Building the Final Script

To finish, modify the genbank2fasta.py script so that the GenBank file to be analyzed (in this example, NC_001133.gbk) is entered as an argument of the script.

You will display an error message if:

The genbank2fasta.py script is used without an argument.
The file provided as an argument does not exist.

To help you, you may read the documentation of the sys and os modules.

Test your finalized script.