Piotr Kozbial wrote:
>> I am interested in testing several ideas about organization of genomic
> information.
>> Could you please send me references about:
>> 1. Sequences management in relational databases.
Hi,
I think storing sequence data with all the known biological information
in a relational SQL database must be be far more efficient than anything
else...
However smart database design seems to be a tricky thing and
bioscientists like
me like the old text formated files, also portability is easy using
these flat.
text files. However I decided to use an SQL database to manage all my
sequence
data plus all the additional stuff connected to the sequences.
> Databases, I know, store data in tables and rows, but sequences seems to
> be stored in flat files (i.e. in FASTA format). Is it good idea to chop
> the sequences and transfer them into relational database? Some kinds of
> sequences are well suited for storage in relational database (i.e.
> protein and cDNA sequences), but genomic sequences are not. Is it good
> idea to cut genomic sequences into fragments containing ORFs with
> theirs upstream and downstream sequence, and with some positioning
> information (i.e.. IDs of upstream and downstream ORFs). With each ORF
> in the database it is possible to store additional information (computed
> or taken from known literature) like:
Hm, if the genome identification is complete I'd say yes, split the
genome into
ORFs including all the positioning data, regulatory elements (if known)
etc. . I
know not much about relational databases but it seems to be a
(biological) problem
splitting the data and than connecting it again when searching the
database.
Honestly - I've no idea about THE ideal solution, so I'd split the data!
> -cDNA sequence,
> -IDs of known aa motives,
> -ID of known conserved structural domains,
> -ID of interacting proteins,
> -pre computed information about structural, sequence, and functional
> homologies (similar to "neighbors" in NCBI databases),
> -all other information (especially raw experimental data),
>> 2. There are lots of tools for sequence analysis written in perl, c,
> c++, etc.
> How the interface between the database and the tools should be designed?
> Are there any examples?
I'd use some software in the middle. You may need a tool that performes
the sql
database query and writes the sequences (results) in a common format
(e.g. fasta)
to a your whatever program (Fasta, Blast ...), otherwise you've to hack
the code
of the existing programs you use.
HTTP-saervers and databases work together via cgi-scripts, there are
lots of
intermediate software packages that are intended to make live easier to
communicate
between applications and database-servers. If you're interested in MySQL
and related
software have a look at http://sunsite.icm.edu.pl/mysql/ .
Is there any documentation about integrating biological information in
relational
databases, any staring points?
that's lots of text with minimal help - however maybe we start a
discussion,
Arne
--
Arne Mueller
Biomolecular Modelling Laboratory
Imperial Cancer Research Fund
44 Lincoln's Inn Fields
London WC2A 3PX, U.K.
phone : +44-(0)171 2693405 | fax :+44-(0)171-269-3534
email : a.mueller at icrf.icnet.uk | http://www.icnet.uk/bmm/