things have been a bit quiet on this group ... many must be having a
good time in Madison at the International Yeast meeting.
I was at the "In silico analysis of Yeast Chromosomes" meeting which
was held in Orsay, France a couple of weeks ago, and one of the many
interesting things that was talked about was the availability by
anonymous FTP of LISTA (version 2).
LISTA is a database (list or flatfile) of all known genes from
Saccharomyces with their gene name, synonyms, accession number,
lenght of coding sequence, codon bias, reference and other comments.
It was put together by Patrick Linder, Reinhard Doelz,
Marie-Odile Mosse, Jaga Lazowska and Piotr P. Slonimski
(see bellow for their respective addresses)
I enclose the lista2.doc bellow, with a few entries at the end,
so you have an idea of what it looks like.
The anonymous FTP address is:
(if your networked machine has trouble with that address
try this IP address 22.214.171.124)
If you do not know about FTP, I have an FTP_starter_kit that
can help you to get started,
regards to all,
A comprehensive compilation of nucleotide sequences
encoding proteins from the yeast Saccharomyces
Release 2, March 1993
Patrick Linder(1), Reinhard Doelz (2), Marie-Odile Mosse(3),
Jaga Lazowska(3) and Piotr P. Slonimski(3)
1 Dept. of Microbiology,Biozentrum,Klingelbergstr. 70,4056 Basel,Switzerland
2 Biocomputing,Biozentrum,Klingelbergstr. 70,4056 Basel,Switzerland
3 Centre de Genetique Moleculaire,Laboratoire propre du CNRS associe a
l'Universite Pierre et Marie Curie,F-91190 Gif sur Yvette,France
Tel +41 61 -267 21 35
FAX +41 61 - 267 21 18
Email: linder at urz.unibas.ch
This manual and the database it accompanies may be
copied and redistributed freely, without advance
permission, provided that this statement is
reproduced with each copy.
This document describes the format and conventions used in this database,
a comprehensive compilation of nucleotide sequences encoding proteins from the
yeast Saccharomyces. Efforts have been undertaken to make the collected data as
easily accessible as possible without restricting their usefulness to a
particular type of computing environment. For this reason,the simplest possible
organisation ("flat file") has been chosen. It is hoped that users with limited
computing experience or facilities will find this organisation easy to work
with, while those requiring a more complex structure for use with more advanced
tools will find reorganisation straightforward enough to be done by a computer
The continued development and improvement of the database depends to a
significant degree on user feedback. A User Report Form for this purpose is
provided at the back of this manual; we hope that you will use it if you find
errors, omissions, or something you think could be done better.This is of impor-
tance particularly in this release, as LISTA Release 3 is currently prepared and
will have new features and additions.
We would like to stress that both this manual and the database itself are free
from any copyright restrictions (please see the statement on the title page).
While we would appreciate acknowledgement if our efforts have been useful to
you, we want to ensure that the data are freely available to anyone interested.
2 SCIENTIFIC BACKGROUND
The amount of nucleotide sequence data is increasing exponentially. We therefore
compiled this genetic database . Each sequence has been attributed a single
genetic name and in the case of allelic duplicated sequences,synonyms are given,
if necessary. Along with the genetic name, the mnemonic from the EMBL databank,
the codon bias, reference of the publication of the sequence and the EMBL acces-
sion numbers are included in each entry.
The database, as previously described in the literature [1,2] contains sequence
data assignments from Saccharomyces cerevisiae,Saccharomyces carlsbergiensis and
Saccharomyces uvarum, which are believed to constitute conspecific taxonomic
species  . Sequences from the unrelated Schizosaccharomyces pombe, Candida,
Hansenula and others are not included. We also exclude sequences from extrageno-
mic elements like the 2-micron plasmid, mitochondrial DNA, killer sequences and
from Ty elements.
3 CONTENTS OF THE DATABASE
The database includes at present a gene name, a synonym in the case the same se-
quence has been published more than once under different names, the mnemonic,the
length of the coding sequence without the stop codon,the codon bias according to
 , the reference of the first publication of the sequence, the accession num-
ber and if necessary a commentary. Other items such as the chromosomal localiza-
tion, description of the gene product, cross-reference to other databases and
adjacent genes will be included in the future.
Genetic nomenclature relies on the glossary compiled by  and was used where-
ever possible. In many cases, however, no or incorrect gene designations have
been given to published sequences. Moreover,the same name was given to different
sequences or different names have been given to the same sequence. To sort out
this problem of nomenclature a priority rule for naming genes in the present da-
tabase  was established. According to this rule the name of the first pub-
lished sequence (date of acceptance of the publication) is used in the list,
provided it is in accordance with the standard genetic nomenclature. Other names
are included as synonyms. In some cases four letter designations (ARGR1, MRPL20)
or gene names followed by a letter (RPL4A, TIF51A) have also been used. In the
case of historically well established gene designations such as HO, it was self-
evident that they should be retained.
Sequences of open reading frames which occur more than once may represent alle-
lic sequences originating from the same gene or non-allelic sequences from dup-
licated genes. This database distinguishes between these two cases by comparing
the 5' and 3' non coding sequences,which in general diverge considerably in non-
allelic duplicated genes but are highly similar or identical in allelic sequen-
ces. Exceptions have been discussed  . In both cases, the results of the
comparisons are included in the comment lines.
5 FORMAT OF THE DATABASE
Each entry in the database is composed of lines. Different types of lines, each
with its own format, are used to record the various types of data which make up
the entry. Note that each line begins with a twocharacter line code, which indi-
cates the type of information contained in the line. The currently used line
types, along with their respective line codes, are listed below.This arrangement
of the database allows an easy integration with other databanks. For example,
links between the LISTA database and the EMBL sequence database were
accomplished using the Sequence Retrieval System program  .
5.1. Gene names and synonymes GN and SY fields
For the nomenclature a standard principle for naming gene sequences based on
priority ruleswas used. A simple method to distinguish duplicated sequences of
one and the same gene from non-allelic sequences of duplicated genes was em-
5.2 Sequence Data References DR fields
The nucleotide sequence data are generally quoted in the database as they have
been published in the EMBL database, subject to some conventions which have
been adopted for the choosen by this database provider.
5.3 Literature References RL fields
The references cited for an entry should be considered a pointer to the
literature and not as assigning scientific credit for the elucidation of the
sequence. Although every effort is made to give complete reference information,
occasionally only a secondary source has been cited. This has happened most
frequently in cases where a secondary reference has presented the data in a form
5.4. Codon Bias CB field
The codon blas was computed as described in .
5.5. Length LN field
The length of the sequence given in the DR field is quoted.
Number of fields Key Description
(begins each entry) GN gene name
0 or more SY synonym
1 or more
per GN or SY DR EMBL accession number and Mnemonic
1 per DR LN length of sequence
1 per DR CB codon bias
1 per DR RL Literature reference
0 or more CC additional comments
1 per entry // end of entry
6 STATUS OF THE DATABASE
The LISTA database is available either on diskettes (M.-O- Mosse,Centre de Gene-
tique Moleculaire, CNRS, F-91190 Gif sur Yvette; mosse at frcgm51.bitnet) or by
anonymous FTP from bioftp.unibas.ch [126.96.36.199] on the internet. New sequences
and comments on the existing database may be sent to P. Linder
(linder at urz.unibas.ch). Release 2 is considered to be preliminary and is about
being extended. Further releases, therefore, are still open for feedback and
suggestions, and it is suggested to contact linder at urz.unibas.ch for comments.
As Release 2 is of preliminary nature,references in the comment lines (CC) refer
to the publication of the LISTA2 database in . It is anticipated to replace
these dependencies in a future release, as well as refine some of the syntax in
the comment lines.
This work was supported by grants from the Ministere de l'Education Nationale,
the Ligue Nationale contre le Cancer and E.E.C.(to P.S.) and by the Swiss Natio-
nal Science Foundation and Kanton Basel-Stadt (to P.L. and R.D.).
1. Mosse, M.O., Brouillet, S., Risler, J.L., Lazowska, J. and Slonimski,
P.P. (1988) Curr. Genet. 14, 529-535.
2. Mosse, M.-O., Linder, P., Lazowska, J. and Slonimski, P.P. (1993)
Curr. Genet. 23, 66-91.
3. Barnett, J.A., Payne, R.W. and Yarrow, D. (1983) (Cambridge
University Press, Cambridge) 811.
4. Bennetzen, J.L. and Hall, B.D. (1982) J. Biol. Chem. 257, 3026-3031.
5. Mortimer, R.K., Contopoulou, C.R. and King, J.S. (1992) Yeast 8, 817-
6. Etzold, T. and Argos, P. (1993) CABIOS 9, 49-57.
USER REPORT FORM
| LISTA database |
| User feedback will help us to | Return to: |
| improve the quality of the service | Patrick Linder |
| we provide. Please use this form | Klingelbergtrsasse 70 |
| to report errors, omissions, | CH 4056 BASEL , Switzerland |
| suggestions or other comments to us. | linder at urz.unibas.ch |
| Name: | Address: |
| Telephone: | |
| Date: | |
| Type of report: [ ]error [ ]problem [ ]suggestion [ ]comment |
| Release of database to which this report applies: |
| Entry or entries affected: |
| Report (please be as precise as possible - attach listings if necessary): |
| Continue on further sheets (or Xerox form) if necessary |
the first 3 LISTA2 entries ...
DR EMBL; M23166; SCNACT.
RL J. BIOL. CHEM. 264:12339-12343(1989).
DR EMBL; X15135; SCNAT.
RL EMBO J. 8:2067-2075(1989).
CC The name is AAA1 (scnact, accepted 23.1.89), the
CC synonym NAT1 (scnat, accepted 23.2.89). Both names
CC are present in the list of (Mortimer et al. 1989). The
CC sequences are identical.
DR EMBL; M12514; SCPET9.
RL MOL. CELL. BIOL. 6:626-634(1986).
DR EMBL; M64706; SCBUB2Q-1.
RL CELL 66:507-517(1991).
CC The sequences are identical. The reading frame of AAC1
CC in scsub2q is partial.
DR EMBL; J04021; SCAAC2.
RL J. BIOL. CHEM. 263:14812-14818(1988).
DR EMBL; M34075; SCAAC3.
RL J. BIOL. CHEM. 265:12711-12716(1990).
CC The sequences are 94.86% (n) and 99.06% (p) identical.
CC The identity in the flanking sequences is 100% and
CC 97.17% for the 5' and 3' regions, respectively.
| B.F. Francis Ouellette (old address: francis at monod.biol.mcgill.ca)
|| new temporary address: francis at ego.psych.mcgill.ca|