Webservers with Terrabytes of Data in - recomended setups

Previous thread: Re: Static Ip's: Routing and Fowarding by Bryan Vyhmeister on Tuesday, April 17, 2007 - 8:09 pm. (1 message)

Next thread: Back again with funny network interfaces by Manuel Ravasio on Wednesday, April 18, 2007 - 6:55 am. (11 messages)
From: Siju George
Date: Wednesday, April 18, 2007 - 2:52 am

Hi,

How Do you handle when you have to Serve terrabytes of Data through
http/https/ftp etc?
Put it on Differrent machines and use some knid of
loadbalancer/intelligent program that directs to the right mahine?

use some kind of clustering Software?

Waht hardware do you use to make your System Scalable from a few
terrabytes of Data to a few hundred of them?

Does OpenBSD have any clustering Software available?

Is anyone running such setups?
Please let me know :-)

Thankyou so much

Kind Regards

Siju

From: Jacob Yocom-Piatt
Date: Wednesday, April 18, 2007 - 5:02 am

can't say with complete confidence b/c i've never done it but using NFS
or AFS would be a start.

AFS would likely be the best solution, albeit with a much sharper
learning curve, and it can be spread over several machines. NFS would
need some system for tracking where which chunk of storage was (a PITA,
AFAICT). if there is an elegant way to achieve this with NFS i would
like to hear about it.

cheers,

From: Joachim Schipper
Date: Thursday, April 19, 2007 - 2:21 pm

I don't really know, but how about some http proxy (hoststated comes to
mind, pound or squid also works) and a lot of hosts each serving a
subset of the total behind that? Yes, that's exactly what you said.

I don't think NFS/AFS is that good an idea; you'll need very beefy
fileservers and a fast network. Maybe rsync'ing from a central
fileserver would work?

However, there are a lot of specialized solutions available (various
SANs come to mind; Google has published several papers on filesystems
and algorithms like MapReduce, although the latter isn't going to help
you for serving HTTP).

All in all, though, I think the most important part are rate of change
and reliability conditions. A big web host might hit an impressive
amount of data, but it doesn't change all that often and a site
occasionally going offline is usually tolerated (just restore a recent
backup). In such cases, something like the above seems to work.

		Joachim

-- 
TFMotD: moduli (5) - system moduli file

From: Stuart Henderson
Date: Thursday, April 19, 2007 - 2:51 pm

NFS may actually be useful; if you really need the files in one
directory space for management/updates that's a way to do it (i.e.
mount all the various storage servers by NFS on a management
station/ftp server/whatever).

For serving content some HTTP-based scheme to get the requests to hit
the right server is probably in order. Proxies are useful if you have
special requirements (for example SSL, where it doesn't make sense to
have the CPU and the disk in the same place), but it normally makes
more sense to distribute the requests to the correct server/s in the
first place (either by front-ends that know the location of content
sending a Location: header if you want to give out URLs with a single
server name) or by the html pointing clients to the files on the

TFMotD: fsck(8) (-: Relying on black-box vendors for fixes is an
additional bonus. Works for some people, though. Allegedly.

From: Daniel Ouellet
Date: Thursday, April 19, 2007 - 3:08 pm

Good idea yes, but if I recall properly, unless major changes have been 
done, isn't it the use of NFS become a huge bottle neck compare to local 
drive? I think the archive is full of complain about the thought put of 
NFS not being so good.

Am I wrong here? I would love to use NFS as well for multiple servers 
accessing one source, but so far, it always being not so good to do that.

If that's wrong please correct me as I would love to know if that still 
the case or not.

Best,

Daniel

From: Stuart Henderson
Date: Thursday, April 19, 2007 - 3:44 pm

I meant using it the other way round: have the *webservers* export
their filesystem, and ftp/management servers mount them to provide a
single space for carrying out updates and backups, locating files,
etc.

Having a bunch of webservers serve data from a large NFS store seems
less attractive for most of the cases I can think of.

The main one I see where it may be attractive is where heavy CGI
processing or similar is done (that's usually a different situation
to having many TB of data, though). In the CGI case, there are some
benefits to distributing files by another way (notably avoiding the
NFS server as a point of failure), rsync as Joachim mentioned is
one way to shift the files around, CVS is also suitable, it
encourages keeping tighter control over changes too, and isn't
difficult to learn.

From: Steven Harms
Date: Thursday, April 19, 2007 - 3:53 pm

This isn't an OpenBSD specific solution, but you should be able to use an
EMC san to accomplish this (we use a fiber channel setup)


From: Bob Beck
Date: Friday, April 20, 2007 - 10:04 am

Bullshit. just use NFS :) 

	-Bob



-- 
#!/usr/bin/perl
if ((not 0 && not 1) !=  (! 0 && ! 1)) {
   print "Larry and Tom must smoke some really primo stuff...\n"; 
}

From: Joachim Schipper
Date: Thursday, April 19, 2007 - 3:23 pm

Something like that might be a very good idea, yes. Just don't try to
serve everything directly off NFS.

(An even better idea might be setting up a repository for your favourite
version control system and making partial checkouts. Gets you most of
the benefit of a unified filesystem, at the cost of complex - and thus
fragile - checkin hooks. On the other hand, version control is likely to

I think doing that in HTML will quickly become an administration

Yeah, they seem to work. It wouldn't be my first choice, either, but
I've never tried to run OpenBSD in this kind of environment. At least a
good, expen$$$ive SAN is good for covering your backside.

		JOachim

-- 
TFMotD: perl561delta (1) - what's new for perl v5.6.x

From: Henning Brauer
Date: Friday, April 20, 2007 - 3:36 am

there is nothing wrong with serving directly from NFS.

-- 
Henning Brauer, hb@bsws.de, henning@openbsd.org
BS Web Services, http://bsws.de
Full-Service ISP - Secure Hosting, Mail and DNS Services
Dedicated Servers, Rootservers, Application Hosting - Hamburg & Amsterdam

From: Joachim Schipper
Date: Friday, April 20, 2007 - 5:42 am

Really? You have a lot more experience in this area, so I will defer to
you if you are sure, but it seems to me that in the sort of system I
explicitly assumed (something like a web farm), serving everything off
NFS would involve either very expensive hardware or be rather slow.

I see how in your example - a lot of storage, not accessed often - just
serving everything off NFS makes perfect sense. However, that was not
what I was talking about.

Perhaps you could elaborate a little? I'm interested, at least...

		Joachim

-- 
TFMotD: hostapd.conf (5) - configuration file for the Host Access Point
daemon

From: Jacob Yocom-Piatt
Date: Friday, April 20, 2007 - 7:03 am

at HPC facilities (LANL, sandia, LLNL, argonne, etc) NFS is used 
extensively for this purpose since the amount of storage required for 
simulation outputs greatly outstrips the storage that any one machine 
can provide, especially the compute nodes. before i switched my email 
address i would get regular notifications that NFS filesystems were down 
for this-or-that many hours at compute facility X. from my observations 
redundancy is the biggest problem with NFS and that its ability to 
efficiently serve up data is more than ample.

AFS provides additional redundancy via volume replication and having the 
various services that comprise it spread over several machines. there is 
a lot of documentation to go through tho.

cheers,

From: Will Maier
Date: Friday, April 20, 2007 - 7:25 am

Redundancy is certainly a problem, but lots of US HPC and
distributed computing sites have severe scaling problems with NFS.
High r/w traffic has killed several file servers in projects that we
work with, and it sucks big time. I don't know anyone who's happy or
excited or confident in their HPC NFS deployments; everyone I've
talked to hopes for a real solution to this problem. ;)

If the OP's use case involves lots of writes (especially from many
clients), I'd be concerned about NFS' ability to keep up. Then
again, I've had problems with pretty much all of the network
filesystems (including AFS, though it's the least bad in my
experience).

I'm still waiting for Ceph[0] to mature (and to shed its linuxisms).
;)

[0] http://ceph.sf.net/

-- 

o--------------------------{ Will Maier }--------------------------o
| web:.......http://www.lfod.us/ | email.........willmaier@ml1.net |
*------------------[ BSD Unix: Live Free or Die ]------------------*

From: Henning Brauer
Date: Friday, April 20, 2007 - 10:56 am

no. cache works. reads are no problem whatsoever in this kind of setup
(well. I am sure you can make that a problem with many frontend servers 
and lots to read. obviously. but for any sane number of frontends, 
should not)

-- 
Henning Brauer, hb@bsws.de, henning@openbsd.org
BS Web Services, http://bsws.de
Full-Service ISP - Secure Hosting, Mail and DNS Services
Dedicated Servers, Rootservers, Application Hosting - Hamburg & Amsterdam

From: Joachim Schipper
Date: Friday, April 20, 2007 - 11:29 am

Yeah, you are right. Now what was I thinking, anyway?

Anyway, thanks!

		Joachim

-- 
TFMotD: pci_make_tag, pci_decompose_tag, pci_conf_read, pci_conf_write
(9) - PCI config space manipulation functions

From: Daniel Ouellet
Date: Friday, April 20, 2007 - 11:53 am

OK, then how well CARP works on NFS for backup mount in case something 
goes wrong with the main NFS server source? Is it efficient, possible 
and mount itself again? Delay? What do you consider a sane number of 
front ends, 10, less, more? Cache, you mean cache on the source NFS, or 
cache on the client NFS? Sorry, look like I have more questions then 
answers as I skip NFS a few years ago because of the bottle neck on the 
NFS transfer. Write was bad, read OK, but not huge. May well be 
different now, I would be happy with decent read, but what can be 
excepted. The archive is not to nice on the subject I have to say. 
Always looks like a bottle neck on the NFS side. If small site, or low 
traffic, yes that's great, but what can one expect to reach the limits 
here? Any ideas?

May be it's time for me to revisit this yet again, but never been very 
succesful with high traffic.

Many thanks

Daniel

From: Clint Pachl
Date: Friday, April 20, 2007 - 11:04 pm

Well, I think that depends on too many variables. I have a movie server 
(OBSD) that exports NFS to two home theatre computers (FBSD). The movie 
server is a dual P3 1GHz with 4 U320 SCSI disks in RAID0. When 
simultaneously playing different DVDs on the two theatre computers, the 
movie server is >90% idle; that's with TCP connection. When using UDP 
mounts it's >96% idle. Although movie files are large sequential data, 

I dump DVDs to VOB format using mplayer, so the files are 4-8GBs. This 
eliminates caching in my situation. So if you had many front-ends 
accessing similar files and caching was taking place, you'd experience 
greater efficiency. I believe NFS does do some caching server-side, like 
directory lookups, etc.

Also, when I rip a DVD, it goes straight to the NFS mount. The 
bottleneck here is my DVD players, which can only read at ~2MB/s. Again, 


I disagree. Too many people try running cheap IDE disks in server 
environments and then wonder why they have poor performance. They blame 
the software. Get SCSI; it is made for highly random access, which is 

Who knows, just try an experiment. From my experience, the bottlenecks 
seem to be the local file system (UFS & disk system) of the exporting 
machine if many clients. Otherwise, it is network bandwidth. NFS seems 
really light on top of UFS, especially when using UDP. BTW, UDP mounts 
are very robust when the clients and server are on the same Ethernet 

All I can say is that I love NFS. You're missing out. Plus it is so 
simple. I have wanted to check out AFS for fail-over reasons, but too 
many docs for me to read.

One last note. Holland's disk structuring is very cool (read his earlier 
post for details). If I were to serve NFS to dozens or hundreds of 
clients I would use his scheme, however, apply his partitioning scheme 
at the host level. If an NFS server is saturated, spread the load by 
adding another server. The drawback is that each client has multiple NFS 
mounts. However, if you have this ...
From: Craig Skinner
Date: Saturday, April 21, 2007 - 1:58 am

I don't have the experience that others here have, but at a small ISP
that I worked for used NFS to serve http. It was a Linux shop, they had
a netapp NFS exporting 5000 users' /home dirs to a dozen 1U cheapo i386
whiteboxes that ran apache/tomcat/cgi etc. Disk and CPU (for cgi, https,
tomcat, php, etc) were seperated. The only problem that they had with
NFS was flock when mbox was used for mail storage for the mail farm
(same netapp). When courier maildir was used, this was not longer an
issue. The web farm was mainly read only, while the mail farm was split
read and write, to the same netapp. All eggs were in the one netapp
basket.......


Now I work for Sun, and they have something like 30,000 employees.
Nearly all staff use Sunray work stations, and home directories are NFS
mounts over a global WAN. There is not one massive /home box, obviously.
There are many home NFS servers, in each of many cities. From here in
Scotland, I can work with an engineer elsewhere by cd'ing to
/somwhere/holland, /nowwhere/japan, /elsewhere/colorado. Only takes a
couple of seconds for the automounter to kick in.

The output of "mount" shows the layout of /home something like:

/home/user1 box1.uk:/export/home5/28/user1
/home/user2 box9.au:/export/home17/2/user2

So, many average sized boxes are used, that in turn have many average
disk packs, that are split.

As is expected, LDAP and NIS are used.
-- 
Craig Skinner | http://www.kepax.co.uk | aye-right@kepax.co.uk

From: Nick Holland
Date: Thursday, April 19, 2007 - 5:53 pm

Too open-ended a question...
Are you talking about many TB on one site?  Lots of sites?
Is there some reason it has to be on one server or one site?
Is this "huge storage, huge demand"?  Huge storage, low demand?
Is this storage all needed on day 1, or will it grow with time?
  (hint: if it grows with time, build for NOW, with ability to
add later, don't buy storage in advance!)
etc.

Let the answers to those questions guide your engineering work,
don't rely on knee-jerk reactions.  And don't be afraid to
change the question to meet available answers. :)  Common
error is to take the given proposed solution (posed as a problem,
but often someone has digested the REAL problem into what they
think is the only possible model, and sent you down a bad alley)
as gospel, and never question the basic assumptions.

I've got a web server with over 3.5TB of storage on it that cost
about $6000US a year or so ago.  It's a huge-storage, low-demand
app, probably gets on average a query a day, if that.  If the
box breaks, time can be spent repairing it, but we don't want to
lose the data (it's carefully backed up, but the backup media
is so compressed, it takes longer to uncompress the files than
it does to scp them back into the box!).  So, the thing has
redundancy where it counts (disk) and simplicity where it
doesn't matter, and it can be upgraded, enhanced and changed
as needed.  And, we have a small enough amount invested in the
thing that we can completely change our mind about the approach
to the problem any time in the future and throw it all away with
a very clear conscience.  (My current boss-of-the-week thinks he
wants to replace this with an unknown proprietary app feeding a
$30,000 per-processor database server attached to a $60,000 disk
array, so you can see how insignificant the price tag on this
system is.  You can also see something about my boss.  And why
I'm looking for a better job).

Let's say you have one website that you are trying to serve
massive amounts of ...
From: Jason Beaudoin
Date: Friday, April 20, 2007 - 8:19 am

What are the reasonings behind this?

Thanks for the awesome post!


regards,

~Jason

From: Tony Abernethy
Date: Friday, April 20, 2007 - 8:32 am

I think it runs something like this
If there is a problem somewhere on the disk,
if it's all one big partition, you must fix the big partition
if it's lots of small partitions, you fix the one with the problem.

Even worse, in some situations, 
the difference is between being dead and being somewhat crippled.

Methinks there's lots of hard-won experience behind Nick's answers ;)

From: J.C. Roberts
Date: Friday, April 20, 2007 - 2:13 pm

You last assumption is the most correct, and Nick has put some of that 
experience into FAQ-14 for our reading pleasure.

In general, you always want to assume a failure *WILL* occur, rather 
than think in terms of "if" something will fail. Having lots of small 
partitions, and using Read Only partitions wherever possible (also 
mentioned by Nick) gives you a number of important advantages. 

Assume that someone, possibly you, has managed to trip over the power 
cord, how long will it take you to get the server back up?

If your partitions are Read/Write, then you will be doing a fsck on each 
of them. That means time.

If your partitions are huge, then you will need a lot of RAM and time to 
preform the fsck. If you have a massive partition and insufficient RAM, 
then your fsck will fail (see FAQ-14.7 "fsck(8) time and memory 
requirements") and you'll be stuck like a turtle on it's back at a soup 
competition.

The above is just your start up time after a crash or power loss.

Assume that someone, possibly you, has written some bad code that will 
scribble all over the data in one of your partitions. How long will it 
take you to recover?

If the partition was marked RO, then you don't have a problem. If it was 
a small RW partition, you can repair it reasonably quickly from backup. 
If your backup media fails, your losses are minimal. By comparison, if 
it's a huge RW partition, then you're stuffed.

The list of reasons goes on and on but when you really think about it, 
you'll understand that you're just doing proper "risk management" by 
trying to mitigate as many of the bad effects of failures as possible.

Never drink the marketing kool-aid that will try to sell you on the idea 
that failures are somehow avoidable. Sure, it might sound like a nice 
idea but the idea always falls short of reality. Being prepared for the 
reality of failures is a much better approach than sticking your head 
in the sand.

/jcr

From: Nick Holland
Date: Friday, April 20, 2007 - 8:52 pm

yeah, though fortunately most of it was in the form of confirmation of

In addition to Tony and J.C.'s comments (I've edited them out for size,
go back and read 'em if you haven't), let me add another really big
reason: Growth and scalability.

Usual logic goes something like this: "I need a lot of space, so I'm
going to build a file system that has a lot of space in it", and you
drop all that space into one file system.  Efficient?  For a while,
yes.  BUT, what about when it fills up?

Usual response: "use a Volume Manager" or "Dump the data to a new,
bigger disk system".  Ok, the ability of some "volume managers" to
dynamically increase the size of a file system is kinda cool, but I
would argue that for many apps, it is just another way of saying,
"The initial design SUCKED and I had more money than brains to fix
the problem" (assuming one of the commercial products, of course).
Somewhat over simplification, of course...but...

Dumping the data from one disk to another is fine and dandy when you
are talking about your 40G disk on your home or desktop computer,
the fact that you are down for a few hours is no big deal.  But what
about a server?  I don't care how fast your disks are, moving 300G of
data to a new disk system is a lot of slow work.

Here's a better idea: break your data into more manageable chunks,
and design the system to fill those chunks AND make it easy to add
more later.  So, you implement today with 1TB of data space, broken
up into two 500G chunks.  Fill the first one, move on to the second
one.  Fill the second one, you bolt on more storage -- a process
which will probably take minutes, not hours.  When you bolt on more
storage, you will be doing it in the future, when capacity is bigger
and cost is less.

Let's look at the machine I mentioned yesterday, our e-mail archive
system:

disks:
Filesystem  1K-blocks      Used     Avail Capacity  Mounted on
/dev/wd0a      199358     49742    139650    26%    /
/dev/wd0e     1030550         6    979018  ...
From: J.C. Roberts
Date: Saturday, April 21, 2007 - 12:08 am

As Nick tried to point out, buying all of the storage you'll eventually
need for a project "right now" generally proves to be a waste of money
over the long haul. Adding storage incrementally is normally cheaper.

There is one, single, very sad exception to the above rule:
	Poorly Configured Hardware RAID sets.

I've seen this problem a lot. Somebody decides they want to put "X"
disks in a RAID-5 array, so they buy a hardware controller and the "X"
disks, then configure the whole thing according to the controller
directions... -Dumb Move.

Since you followed along with your controller documentation to set up
your RAID-5 across all the disks, you probably didn't notice that you
used all of the space on each of the drives -- or you didn't think
about the ramifications of using all of the space.

Guess what? -If one of those disks fail, you may be stuffed. The reason
is simply that you may not be able to get another disk of the exact
same size -- EVEN IF IT HAS AN IDENTICAL PART NUMBER.

Most people do not realize there are subtle differences between disks,
yes, even between disks with the exact same part number. When
manufactured, one batch of drives might allow for slightly more or less
space than the next batch of drives manufactured. As long as the drives
will hold at least "N" number of bits and they satisfy the marketing or
packaging claims (i.e. "500 Gigbytes" or whatever), they get sold.

So all of your original disks looked identical and allowed you to use
the same exact size for the total space on each. And since you followed
the directions, you used all of it. The trouble is finding a correct
replacement when one of the originals fail.

If you are really lucky, then you might get a replacement disk that is
just slightly larger than the ones you originally purchased. In this
case you waste a few megabytes when you configure it to only use the
same size as the others in the RAID. The same is sometimes true if you
can buy a replacement disk with greater capacity ...
From: John Brahy
Date: Thursday, May 10, 2007 - 12:03 pm

Hello List,

We're the proud new owner of a 10x750GB appliance. We're going to put
OpenBSD on it and I was looking for suggestions or feedback on a
configuration we were considering. This server is going to be stored at our
colo and we have a point to point T1 directly connected to it. (We're going
to initially populate it here and only have to rsync daily differences after
hours.) 

Luca-Brozzi.ad2.com
---------------------

Partition	Size(GB)
 /			2
 swap			8
 /usr			4
 /usr/local		4
 /usr/obj		4
 /usr/src		4
 /var			2
 /home		20
 /tmp			2
 /backups/server1	400
 /backups/server2	400
 /backups/server3	400
 /backups/server4	400
 /backups/server5	400
 /backups/server6	400
 /backups/server7	400
 /backups/server8	400
 /backups/server9	400


Is this the best way to do it? Does anyone have suggestions on a better way
to do it?

Thanks,

John

From: Matt Bettinger
Date: Thursday, May 10, 2007 - 12:21 pm

Hi,

I believe in using the right tool for the job and,  to be honest I
wouldn't use OpenBSD for a large data store like that.  If it were me
I'd get a real SAN or NAS  but you have what you have so my top choice
would be an OS that you can run an Volume manager on,  Linux with LVM2
or Veritas VM.  FreeBSD has some Volume Management capabilities but I
have no experience using them.  Sorry if my answer offends you.

Matt

From: Timo Schoeler
Date: Thursday, May 10, 2007 - 12:40 pm

On Thu, 10 May 2007 14:21:23 -0500

I second that, except for GNU/Linux and FreeBSD; I'd really recommend
to run, if possible, Solaris and take advantage of ZFS with all its
nice tools and features.

Btw, can you specify what this appliance is? I have an EMC Cellerra at
work which /could/ be used as a highly redundant and nice performing
CIFS server (authentication to be done by another machine, though). We
found this out after figuring out weeks of how to add a second/third
machine to our *cough* RHEL *cough* server infrastructure to get a
redundant setup (the file server is connected to another EMC, a 3TByte
CX300, using FC) using 'a' cluster filesystem. This turned out to be a
real PITA -- and then someone told us that the Cellerra can do this
most conveniently. Guess what it is doing right now? It exports a
3TByte NFSv3 FS. geeeeeeees...

To make a long story short: Really THINK VERY HARD on this setup. Once
you decided which way you go and store 3TByte of data there (regardless
of the way *how* you do it, using GNU/Linux, FreeBSD, Solaris or
DR-DOS ;) be sure it will be a real PITA to get this corrected IF you
have to...

timo

From: John Brahy
Date: Thursday, May 10, 2007 - 2:58 pm

>  I'd really recommend to run, if possible, Solaris and take advantage of

That's a great idea, I always think OpenBSD for everything but I don't want
to know how long it would take to fsck 3.75TB. 

I'm going to go with Solaris w/ZFS. 

Thanks!

From: Jimmy Mitchener
Date: Thursday, May 10, 2007 - 2:46 pm

I'm inclined to agree here, at least until OpenBSD gets stable ffs2 support
(allowing filesystems larger than 1tb), but until then, I'd really recommend
going the GNU/Linux or FreeBSD route. Although I'd probably favor GNU/Linux
with LVM for a large data store.

Jimmy.

From: Joachim Schipper
Date: Thursday, May 10, 2007 - 2:58 pm

It really depends. The volume manager crowd have a point in that a
volume manager can make it easier to do this sort of thing (supporting
really large filesystems would work as well, but that's still being
worked on).

However, quite a few backup systems will happily stripe the backups
across as many disks as you feed them; AMANDA can certainly do this,
although it's not really a good fit for filesystem-based backups. I'd be
wary of the 'one disk per server' method you use above, though; that's
not likely to be a good map in the future. You might even want to
consider mounting ~ 2TB ccds under, say, /disks and symlinking
/backups/server1, ... to those, mostly for psychological reasons.

You might want to consider various variants on RAID, too. This depends
on the uptime requirements, obviously, but if this is the only place
you'll store backups, you'll want to make sure a simple disk failure
doesn't cause too much trouble.

Otherwise, your non-backup directories are ridiculously large, but
that's not really going to hurt you in this case, and taking this much
storage offline for repartitioning would be painful.

		Joachim

-- 
PotD: x11/xtraceroute - graphical version of traceroute

Previous thread: Re: Static Ip's: Routing and Fowarding by Bryan Vyhmeister on Tuesday, April 17, 2007 - 8:09 pm. (1 message)

Next thread: Back again with funny network interfaces by Manuel Ravasio on Wednesday, April 18, 2007 - 6:55 am. (11 messages)