NFS databases

Bill Pearson wrp at dayhoff.med.Virginia.EDU
Tue Jun 20 21:19:15 EST 1995

In article <3rvoc1$kc9 at gap.cco.caltech.edu>,
 <mathog at seqaxp.bio.caltech.edu> wrote:
>Interesting, since we routinely saturate the nets whenever we FTP stuff
>between our two AXPs.
>What is your hardware configuration?  Do you have a switching hub
>or anything special?

	As far as I know, there is nothing special about our hardware
configuration. All the machines are on the same subnet.  We are
exclusively a UNIX shop with NFS buffer sizes set to 8K (optimal if
you don't loose packets I'm told).  We can also saturate the network
with FTP, but the question had to do with NFS mounting sequence
databases.  In this case, there is almost always some computation done
on their contents, which is probably why saturation is less of an

>Genbank is around 150Mb of sequence data.  On our local subnet data moves
>at around 700-800 kb/sec between the AXPs via binary FTP.  Probably that's
>about the same for NFS too.  So a full search, if not compute bound, would
>take about 190 seconds, or 3 minutes, to move the whole database.  If the
>client cannot process the data at this rate, then you will load the net
>proportionally less.  

	But almost any search is compute  bound.  FASTA, the program
I am most familiar with, uses less than 10% of its time reading even
a low-density library like Genbank flat files.  With more compact
databases (blast or PIR format), less than 5% of the time is spent
reading the data.

>So the key question is how long does each of these searches take?  Ie, what
>fraction of the time are you moving data, and what fraction crunching it? 

(see above)

>Here's a command that, if run on a fast machine,  would seem very likely to
>saturate a net if Genbank were NFS mounted: 
>$ findpatterns/infile=gb:*/pattern=AGCTAGCTAGCTAGCTACGT/default
>Since you've got this configuration already set up, perhaps you wouldn't
>mind doing the experiment?  Your choice of measures for how much capacity
>remains in the net.

In fact, I do not have control over a machine with GCG installed. On
our systems, any job than runs over 5 minutes gets dumped in a low
priority queue so the timings are not very helpful.  In addition, the
machines tend to be loaded 24hr/day, so that tonight, when I did timings,
there were always two other users competing for time.

I did 4 timings using my version of FASTA (not GCG's). The command I used

/bin/time fasta -q mgstm1.seq "gbmam.seq 1" >& mgstm1.ok6 &

I did the timings on three machines, one RS/6000 model 370 (dayhoff)
with a local disk, one RS/6000 model 390 (avery) with an NFS disk from
dayhoff, and two on an Alpha2100 4/275 (alpha0) again NFS file
sharing, during the simultaneous searches on dayhoff and avery or when
no other searches were going on.  The load averages on the two
rs/6000's during the search were 2.5-3; the alpha was single user.

The times were:			alpha0		alpha0
	dayhoff		avery	(3 simultaneous	(no other
	(local disk)	(NFS)	searches)		search)
real:	150.9 sec	100.7	20.0		19.3
user:	 60.7		 38.5	18.5		18.2

Since the alpha is a dual processor, I also tried two simultaneous
searches at the same time as a search on another machine (for a total
of 3). Essentially no effect was seen.

gbmam.seq is about 23Mbytes, so on the alpha, the program was reading
and searching >1 Mb/sec, without any degredation by simultaneous
searches elsewhere.  fgrep on the same file took 33 sec, while cat
from the NFS system to the local system took 57 sec.

I also tried a fasta during an fgrep on the same alpha, this increased 
search time from 19 to 24 sec.

So, while I don't deny that ethernets can be saturated, I think it is
difficult to saturate the net while doing useful things with the
content of the data. Thus, I don't think that NFS file-sharing
databases are a problem for "typical" installations.

Bill Pearson
wrp at virginia.EDU
Dept. of Biochemistry #440
U. of Virginia
Charlottesville, VA 22908

More information about the Info-gcg mailing list

Send comments to us at biosci-help [At] net.bio.net