IUBio

FASTA format to Tab-Delimited: How??

K.James bss194 at thunder
Sun Dec 15 10:56:15 EST 1996


Jon Duvick (duvickj at phibred.com) wrote:
: Are there any programs out there that will parse FASTA files
: (specifically, text files containing multiple sequences in FASTA
: format) into common database formats such as tab-delimited?
: Thanks
: Jon Duvick
: duvickj at phibred.com

As I'm teaching myself Perl and needed some simple, real-life tasks
to code up, I put together a script to do this. I guessed that you
wanted the file

> HBING_DH Human bing dehydrogenase
CATCATGCGCGCTACGCATCGACGC
ACGATCGCAGTGGTGATGAGCGAGA
> MBING_AL Moose bing aldolase
CGACGTACGATACGATA CAGCGCGCGCGCATACGACT
CGTCGATACAGTCGGGG TAGTAGATAGATAGGGGGGG
etc.

to go to

HBING_DH (tab) Human bing dehydrogenase (tab) CATCAT...
MBING_AL (tab) Moose bing aldolase (tab) CGAGCT....
etc.

Of course, I could be wrong and you might want tabs between all
the residues! If so, mail/post and I'll write one to do that.

---cut here---

#!usr/bin/perl -w
#
# FASTApar version 0.1   k.james at bangor.ac.uk
# My 2nd Perl script
#
# Parses FASTA files into tab-delimited files.
# The output file is of the format:
#   SEQNAME tab SEQDESCRIPTION tab SEQUENCE
#

( $seq_source, $seq_dest ) = @ARGV;

$usage = "perl fastapar.pl source_file destination_file";
unless ( $seq_source && $seq_dest )
{ die "\nUsage: $usage\n" };

open( SOURCE, "$seq_source" ) or die 
"Couldn't find the source file $seq_source!\n";

LINE:	while( $line = <SOURCE> ){
		chomp( $line );

SWITCH: {

if ( !$reading_seq && $line =~ /^>/ ) {
		# found the start of a sequence, so get the header
		# and shift off the > character

		$reading_seq = 1;
		@header = split /\s+/, $line;
		shift @header;
		last SWITCH;
		};

if ( $reading_seq && $line !~ /^>/ ) {
		# already reading a sequence and no new header on
		# this line, so remove whitespace and add the line
		# to the currently read sequence

		$line =~ s/\s+//g;
		push @sequence, $line;
		last SWITCH;
		};

if ( $reading_seq && $line =~ /^>/ ) {
		# already reading a sequence but there is a header
		# on this line, so stop reading, make an entry in
		# the output list, clear the sequence list and redo
		# that line

		$reading_seq = 0;
		@entry = (
				shift @header,
				( join ( " ", @header ) ),
				( join ( "", @sequence ) )
				);
		push @output, join ( "\t", @entry );
		undef @sequence;
		redo LINE;
		};
	}
}

	# add the last sequence to the output list when
	# we have run out of lines in the source file as
	# this is the only one whose end is not delimited
	# by the start of a new sequence

	@entry = (
			shift @header,
			( join ( " ", @header ) ),
			( join ( "", @sequence ) )
			);
	push @output, join ( "\t", @entry );


print ("Found " . @output . " FASTA files in $seq_source\n");

open ( DESTINATION, ">$seq_dest" ) or die
"Couldn't create the destination file $seq_dest!\n";

select ( DESTINATION );
foreach $entry ( @output ) {
	print $entry . "\n";
}

close ( DESTINATION );

---cut here---

Copy the text into a file called fastapar.pl and launch it with
"perl fastapar.pl source_file destination_file" I've tested it
with a few FASTA files and it seemed to work OK (Perl5 for win32
on NT 4). It's a bit primitive- feel free to suggest improvements.

--
Keith James PhD - k.james at bangor.ac.uk    PGP 2.6.2i  Key ID 469A9FA1
Biodegradation Group                         *Encrypt and Survive*  
School of Biological Sciences             Nightmare: Quake me up now!
University of Wales, Bangor, UK                                     
-------http://oracle.bangor.ac.uk/sbs/research/biodegradation/-------




More information about the Bio-soft mailing list

Send comments to us at biosci-help [At] net.bio.net