I am looking for ways to improve the following piece of code in such a way that instead of making a total count of the bases of all the records found, it does so by registration. On the other hand I would also like to make it sensitive to one type of record or another. For example in my case if in a line other than the header ('>') of the record appears a U would be an RNA type record if there is no DNA type, but for this last I am having problems with the level of bleeding.
This is my code:
import os
def registros(fichero):
registros = 0
try:
with open(fichero, 'r') as f:
if os.path.isfile(fichero) == True:
print('Se encontró Fichero Fasta:',fichero)
lineas = f.readlines()
for l in lineas:
if l.startswith('>'):
registros += 1 inicial
else:
continue
except FileNotFoundError:
print('El fichero introducido no se ha encontrado, asegúrese de que se encuentra en ese directorio!')
return registros
def numero_bases(fichero):
A = 0
T = 0
C = 0
G = 0
U = 0
try:
with open(fichero, 'r') as f:
lineas = f.readlines()
for l in lineas:
if not l.startswith('>'):
for base in l:
if base == 'A':
A += 1
elif base == 'T':
T += 1
elif base == 'G':
G += 1
elif base == 'C':
C += 1
elif base == 'U':
U += 1
else:
continue
except FileNotFoundError:
print('Pruebe a Introducir un fichero existente')
return 'con el siguiente contenido de bases:\nAdenina: {}\nTimina: {}\nCitosina: {}\nGuanina: {}\nUracilo: {}'.format(A, T, C, G, U)
while True:
fichero = input('Introduzca nombre del fichero FASTA(q para salir):\n')
if fichero == 'q':
break
print('El fichero',fichero, 'contiene',registros(fichero),'registros', numero_bases(fichero))
An example for the file called 2.fasta that has this content would be like this:
>YAL069W-1.334 Putative promoter sequence
CCACUG
CCACGG
>YAL068C-7235.2170 Putative promoter sequence
TACGC
TACGGG
The entry of the data would be of the type:
Introduzca nombre del fichero FASTA(q para salir):
2.fasta
The output of the data should look something like this:
Se encontró Fichero Fasta: 2.fasta
El fichero 2.fasta contiene 2 registros con el siguiente contenido de bases:
>YAL069W-1.334 Putative promoter sequence:
Es un registro de tipo RNA:
Adenina: 2
Timina: 0
Citosina: 6
Guanina: 3
Uracilo: 1
>YAL068C-7235.2170 Putative promoter sequence:
Es un registro de tipo DNA:
Adenina: 2
Timina: 2
Citosina: 3
Guanina: 4
Uracilo: 0
Records of type RNA have U instead of T and vice versa with records of type DNA.