Count base frequency in record file

1

I am looking for ways to improve the following piece of code in such a way that instead of making a total count of the bases of all the records found, it does so by registration. On the other hand I would also like to make it sensitive to one type of record or another. For example in my case if in a line other than the header ('>') of the record appears a U would be an RNA type record if there is no DNA type, but for this last I am having problems with the level of bleeding.

This is my code:

import os


def registros(fichero):
    registros = 0 
    try:
        with open(fichero, 'r') as f:
            if os.path.isfile(fichero) == True: 
                print('Se encontró Fichero Fasta:',fichero)
            lineas = f.readlines() 
            for l in lineas:  
                if l.startswith('>'): 
                    registros += 1 inicial
                else:
                    continue 
    except FileNotFoundError: 
        print('El fichero introducido no se ha encontrado, asegúrese de que se encuentra en ese directorio!')

    return registros

def numero_bases(fichero):
    A = 0
    T = 0
    C = 0
    G = 0
    U = 0
   try: 
       with open(fichero, 'r') as f: 
       lineas = f.readlines() 
       for l in lineas:
            if not l.startswith('>'):
                for base in l:
                    if base  == 'A':
                        A += 1
                    elif base == 'T':
                        T += 1
                    elif base == 'G':
                        G += 1
                    elif base == 'C':
                        C += 1
                    elif base == 'U':
                        U += 1

            else:
                continue

    except FileNotFoundError: 
        print('Pruebe a Introducir un fichero existente')

    return 'con el siguiente contenido de bases:\nAdenina: {}\nTimina: {}\nCitosina: {}\nGuanina: {}\nUracilo: {}'.format(A, T, C, G, U)


while True:
    fichero = input('Introduzca nombre del fichero FASTA(q para salir):\n')
    if fichero == 'q':
        break
    print('El fichero',fichero, 'contiene',registros(fichero),'registros', numero_bases(fichero))

An example for the file called 2.fasta that has this content would be like this:

>YAL069W-1.334 Putative promoter sequence
CCACUG
CCACGG

>YAL068C-7235.2170 Putative promoter sequence
TACGC
TACGGG

The entry of the data would be of the type:

Introduzca nombre del fichero FASTA(q para salir):
2.fasta

The output of the data should look something like this:

Se encontró Fichero Fasta: 2.fasta
El fichero 2.fasta contiene 2 registros con el siguiente contenido de bases:
>YAL069W-1.334 Putative promoter sequence:
Es un registro de tipo RNA:
Adenina: 2
Timina: 0
Citosina: 6
Guanina: 3
Uracilo: 1
>YAL068C-7235.2170 Putative promoter sequence:
Es un registro de tipo DNA:
Adenina: 2
Timina: 2
Citosina: 3
Guanina: 4
Uracilo: 0

Records of type RNA have U instead of T and vice versa with records of type DNA.

    
asked by Steve Jade 29.10.2017 в 20:04
source

1 answer

1

You can use collections.Counter to count the bases of each line. To know if it is RNA or DNA, it is enough that we check whether the "U" key is in the dictionary or not.

It is also important, especially if you are going to work with relatively large files, do not go through them more than once.

There are multiple ways to organize the code, one of them can be:

import collections
import os



def registros(fichero):
    registros = collections.OrderedDict()
    contador = 0
    for linea in fichero:
        if linea.startswith('>'):
            contador += 1
            registros[contador] = collections.Counter()
        else:
            registros[contador].update(linea)
    return registros

def parse_fasta(path):
    try:
        with open(path, 'r') as f:
            if os.path.isfile(path): 
                print('Se encontró Fichero Fasta:', fichero)
                regs = registros(f)
                print("El fichero {} contiene {} registros con el siguiente contenido de bases:".format(fichero, len(regs)))
                for reg, bases in regs.items():
                    print("  Registro {}:".format(reg))
                    if "U" in bases:
                        print("   Es un registro de tipo ARN:")
                        print("    Adenina:  {}\n    Uracilo:  {}\n    Citosina: {}\n    Guanina:  {}\n".format(bases["A"],
                                                                                                                bases["U"], 
                                                                                                                bases["C"], 
                                                                                                                bases["G"]))
                    else:
                        print("   Es un registro de tipo ADN:")
                        print("    Adenina:  {}\n    Timina:   {}\n    Citosina: {}\n    Guanina:  {}\n".format(bases["A"],
                                                                                                                bases["T"], 
                                                                                                                bases["C"], 
                                                                                                                bases["G"]))
            else:
                print("La ruta no se corresponde con un fichero.")

    except FileNotFoundError: 
        print('El fichero introducido no se ha encontrado, asegúrese de que se encuentra en ese directorio!')

while True:
    fichero = input('Introduzca nombre del fichero FASTA(q para salir):\n')
    if fichero == 'q':
        break
    parse_fasta(fichero)    

The output for your example is:

>>> Introduzca nombre del fichero FASTA(q para salir):
2.fasta
Se encontró Fichero Fasta: 2.fasta
El fichero 2.fasta contiene 2 registros con el siguiente contenido de bases:
  Registro 1:
   Es un registro de tipo ARN:
    Adenina:  2
    Uracilo:  1
    Citosina: 6
    Guanina:  3

  Registro 2:
   Es un registro de tipo ADN:
    Adenina:  2
    Timina:   2
    Citosina: 3
    Guanina:  4

If you want the registry to be named according to the header (as long as the headers within a file are unique) you can modify the function registros :

def registros(fichero):
    registros = collections.OrderedDict()
    for linea in fichero:
        if linea.startswith('>'):
            nombre = linea[1:].strip()
            registros[nombre] = collections.Counter()
        else:
            registros[nombre].update(linea)
    return registros

The function assumes that the first line of the file will always be a header (or an empty file), never a blank line or anything else. If it is possible that this is not the case, it should be taken into account.

    
answered by 29.10.2017 в 23:46