Hi, How Do you handle when you have to Serve terrabytes of Data through http/https/ftp etc? Put it on Differrent machines and use some knid of loadbalancer/intelligent program that directs to the right mahine? use some kind of clustering Software? Waht hardware do you use to make your System Scalable from a few terrabytes of Data to a few hundred of them? Does OpenBSD have any clustering Software available? Is anyone running such setups? Please let me know :-) Thankyou so much Kind Regards Siju
can't say with complete confidence b/c i've never done it but using NFS or AFS would be a start. AFS would likely be the best solution, albeit with a much sharper learning curve, and it can be spread over several machines. NFS would need some system for tracking where which chunk of storage was (a PITA, AFAICT). if there is an elegant way to achieve this with NFS i would like to hear about it. cheers,
I don't really know, but how about some http proxy (hoststated comes to mind, pound or squid also works) and a lot of hosts each serving a subset of the total behind that? Yes, that's exactly what you said. I don't think NFS/AFS is that good an idea; you'll need very beefy fileservers and a fast network. Maybe rsync'ing from a central fileserver would work? However, there are a lot of specialized solutions available (various SANs come to mind; Google has published several papers on filesystems and algorithms like MapReduce, although the latter isn't going to help you for serving HTTP). All in all, though, I think the most important part are rate of change and reliability conditions. A big web host might hit an impressive amount of data, but it doesn't change all that often and a site occasionally going offline is usually tolerated (just restore a recent backup). In such cases, something like the above seems to work. Joachim -- TFMotD: moduli (5) - system moduli file
NFS may actually be useful; if you really need the files in one directory space for management/updates that's a way to do it (i.e. mount all the various storage servers by NFS on a management station/ftp server/whatever). For serving content some HTTP-based scheme to get the requests to hit the right server is probably in order. Proxies are useful if you have special requirements (for example SSL, where it doesn't make sense to have the CPU and the disk in the same place), but it normally makes more sense to distribute the requests to the correct server/s in the first place (either by front-ends that know the location of content sending a Location: header if you want to give out URLs with a single server name) or by the html pointing clients to the files on the TFMotD: fsck(8) (-: Relying on black-box vendors for fixes is an additional bonus. Works for some people, though. Allegedly.
Good idea yes, but if I recall properly, unless major changes have been done, isn't it the use of NFS become a huge bottle neck compare to local drive? I think the archive is full of complain about the thought put of NFS not being so good. Am I wrong here? I would love to use NFS as well for multiple servers accessing one source, but so far, it always being not so good to do that. If that's wrong please correct me as I would love to know if that still the case or not. Best, Daniel
I meant using it the other way round: have the *webservers* export their filesystem, and ftp/management servers mount them to provide a single space for carrying out updates and backups, locating files, etc. Having a bunch of webservers serve data from a large NFS store seems less attractive for most of the cases I can think of. The main one I see where it may be attractive is where heavy CGI processing or similar is done (that's usually a different situation to having many TB of data, though). In the CGI case, there are some benefits to distributing files by another way (notably avoiding the NFS server as a point of failure), rsync as Joachim mentioned is one way to shift the files around, CVS is also suitable, it encourages keeping tighter control over changes too, and isn't difficult to learn.
This isn't an OpenBSD specific solution, but you should be able to use an EMC san to accomplish this (we use a fiber channel setup)
Bullshit. just use NFS :)
-Bob
--
#!/usr/bin/perl
if ((not 0 && not 1) != (! 0 && ! 1)) {
print "Larry and Tom must smoke some really primo stuff...\n";
}
Something like that might be a very good idea, yes. Just don't try to serve everything directly off NFS. (An even better idea might be setting up a repository for your favourite version control system and making partial checkouts. Gets you most of the benefit of a unified filesystem, at the cost of complex - and thus fragile - checkin hooks. On the other hand, version control is likely to I think doing that in HTML will quickly become an administration Yeah, they seem to work. It wouldn't be my first choice, either, but I've never tried to run OpenBSD in this kind of environment. At least a good, expen$$$ive SAN is good for covering your backside. JOachim -- TFMotD: perl561delta (1) - what's new for perl v5.6.x
there is nothing wrong with serving directly from NFS. -- Henning Brauer, hb@bsws.de, henning@openbsd.org BS Web Services, http://bsws.de Full-Service ISP - Secure Hosting, Mail and DNS Services Dedicated Servers, Rootservers, Application Hosting - Hamburg & Amsterdam
Really? You have a lot more experience in this area, so I will defer to you if you are sure, but it seems to me that in the sort of system I explicitly assumed (something like a web farm), serving everything off NFS would involve either very expensive hardware or be rather slow. I see how in your example - a lot of storage, not accessed often - just serving everything off NFS makes perfect sense. However, that was not what I was talking about. Perhaps you could elaborate a little? I'm interested, at least... Joachim -- TFMotD: hostapd.conf (5) - configuration file for the Host Access Point daemon
at HPC facilities (LANL, sandia, LLNL, argonne, etc) NFS is used extensively for this purpose since the amount of storage required for simulation outputs greatly outstrips the storage that any one machine can provide, especially the compute nodes. before i switched my email address i would get regular notifications that NFS filesystems were down for this-or-that many hours at compute facility X. from my observations redundancy is the biggest problem with NFS and that its ability to efficiently serve up data is more than ample. AFS provides additional redundancy via volume replication and having the various services that comprise it spread over several machines. there is a lot of documentation to go through tho. cheers,
Redundancy is certainly a problem, but lots of US HPC and distributed computing sites have severe scaling problems with NFS. High r/w traffic has killed several file servers in projects that we work with, and it sucks big time. I don't know anyone who's happy or excited or confident in their HPC NFS deployments; everyone I've talked to hopes for a real solution to this problem. ;) If the OP's use case involves lots of writes (especially from many clients), I'd be concerned about NFS' ability to keep up. Then again, I've had problems with pretty much all of the network filesystems (including AFS, though it's the least bad in my experience). I'm still waiting for Ceph[0] to mature (and to shed its linuxisms). ;) [0] http://ceph.sf.net/ -- o--------------------------{ Will Maier }--------------------------o | web:.......http://www.lfod.us/ | email.........willmaier@ml1.net | *------------------[ BSD Unix: Live Free or Die ]------------------*
no. cache works. reads are no problem whatsoever in this kind of setup (well. I am sure you can make that a problem with many frontend servers and lots to read. obviously. but for any sane number of frontends, should not) -- Henning Brauer, hb@bsws.de, henning@openbsd.org BS Web Services, http://bsws.de Full-Service ISP - Secure Hosting, Mail and DNS Services Dedicated Servers, Rootservers, Application Hosting - Hamburg & Amsterdam
Yeah, you are right. Now what was I thinking, anyway? Anyway, thanks! Joachim -- TFMotD: pci_make_tag, pci_decompose_tag, pci_conf_read, pci_conf_write (9) - PCI config space manipulation functions
OK, then how well CARP works on NFS for backup mount in case something goes wrong with the main NFS server source? Is it efficient, possible and mount itself again? Delay? What do you consider a sane number of front ends, 10, less, more? Cache, you mean cache on the source NFS, or cache on the client NFS? Sorry, look like I have more questions then answers as I skip NFS a few years ago because of the bottle neck on the NFS transfer. Write was bad, read OK, but not huge. May well be different now, I would be happy with decent read, but what can be excepted. The archive is not to nice on the subject I have to say. Always looks like a bottle neck on the NFS side. If small site, or low traffic, yes that's great, but what can one expect to reach the limits here? Any ideas? May be it's time for me to revisit this yet again, but never been very succesful with high traffic. Many thanks Daniel
Well, I think that depends on too many variables. I have a movie server (OBSD) that exports NFS to two home theatre computers (FBSD). The movie server is a dual P3 1GHz with 4 U320 SCSI disks in RAID0. When simultaneously playing different DVDs on the two theatre computers, the movie server is >90% idle; that's with TCP connection. When using UDP mounts it's >96% idle. Although movie files are large sequential data, I dump DVDs to VOB format using mplayer, so the files are 4-8GBs. This eliminates caching in my situation. So if you had many front-ends accessing similar files and caching was taking place, you'd experience greater efficiency. I believe NFS does do some caching server-side, like directory lookups, etc. Also, when I rip a DVD, it goes straight to the NFS mount. The bottleneck here is my DVD players, which can only read at ~2MB/s. Again, I disagree. Too many people try running cheap IDE disks in server environments and then wonder why they have poor performance. They blame the software. Get SCSI; it is made for highly random access, which is Who knows, just try an experiment. From my experience, the bottlenecks seem to be the local file system (UFS & disk system) of the exporting machine if many clients. Otherwise, it is network bandwidth. NFS seems really light on top of UFS, especially when using UDP. BTW, UDP mounts are very robust when the clients and server are on the same Ethernet All I can say is that I love NFS. You're missing out. Plus it is so simple. I have wanted to check out AFS for fail-over reasons, but too many docs for me to read. One last note. Holland's disk structuring is very cool (read his earlier post for details). If I were to serve NFS to dozens or hundreds of clients I would use his scheme, however, apply his partitioning scheme at the host level. If an NFS server is saturated, spread the load by adding another server. The drawback is that each client has multiple NFS mounts. However, if you have this ...
I don't have the experience that others here have, but at a small ISP that I worked for used NFS to serve http. It was a Linux shop, they had a netapp NFS exporting 5000 users' /home dirs to a dozen 1U cheapo i386 whiteboxes that ran apache/tomcat/cgi etc. Disk and CPU (for cgi, https, tomcat, php, etc) were seperated. The only problem that they had with NFS was flock when mbox was used for mail storage for the mail farm (same netapp). When courier maildir was used, this was not longer an issue. The web farm was mainly read only, while the mail farm was split read and write, to the same netapp. All eggs were in the one netapp basket....... Now I work for Sun, and they have something like 30,000 employees. Nearly all staff use Sunray work stations, and home directories are NFS mounts over a global WAN. There is not one massive /home box, obviously. There are many home NFS servers, in each of many cities. From here in Scotland, I can work with an engineer elsewhere by cd'ing to /somwhere/holland, /nowwhere/japan, /elsewhere/colorado. Only takes a couple of seconds for the automounter to kick in. The output of "mount" shows the layout of /home something like: /home/user1 box1.uk:/export/home5/28/user1 /home/user2 box9.au:/export/home17/2/user2 So, many average sized boxes are used, that in turn have many average disk packs, that are split. As is expected, LDAP and NIS are used. -- Craig Skinner | http://www.kepax.co.uk | aye-right@kepax.co.uk
Too open-ended a question... Are you talking about many TB on one site? Lots of sites? Is there some reason it has to be on one server or one site? Is this "huge storage, huge demand"? Huge storage, low demand? Is this storage all needed on day 1, or will it grow with time? (hint: if it grows with time, build for NOW, with ability to add later, don't buy storage in advance!) etc. Let the answers to those questions guide your engineering work, don't rely on knee-jerk reactions. And don't be afraid to change the question to meet available answers. :) Common error is to take the given proposed solution (posed as a problem, but often someone has digested the REAL problem into what they think is the only possible model, and sent you down a bad alley) as gospel, and never question the basic assumptions. I've got a web server with over 3.5TB of storage on it that cost about $6000US a year or so ago. It's a huge-storage, low-demand app, probably gets on average a query a day, if that. If the box breaks, time can be spent repairing it, but we don't want to lose the data (it's carefully backed up, but the backup media is so compressed, it takes longer to uncompress the files than it does to scp them back into the box!). So, the thing has redundancy where it counts (disk) and simplicity where it doesn't matter, and it can be upgraded, enhanced and changed as needed. And, we have a small enough amount invested in the thing that we can completely change our mind about the approach to the problem any time in the future and throw it all away with a very clear conscience. (My current boss-of-the-week thinks he wants to replace this with an unknown proprietary app feeding a $30,000 per-processor database server attached to a $60,000 disk array, so you can see how insignificant the price tag on this system is. You can also see something about my boss. And why I'm looking for a better job). Let's say you have one website that you are trying to serve massive amounts of ...
What are the reasonings behind this? Thanks for the awesome post! regards, ~Jason
I think it runs something like this If there is a problem somewhere on the disk, if it's all one big partition, you must fix the big partition if it's lots of small partitions, you fix the one with the problem. Even worse, in some situations, the difference is between being dead and being somewhat crippled. Methinks there's lots of hard-won experience behind Nick's answers ;)
You last assumption is the most correct, and Nick has put some of that experience into FAQ-14 for our reading pleasure. In general, you always want to assume a failure *WILL* occur, rather than think in terms of "if" something will fail. Having lots of small partitions, and using Read Only partitions wherever possible (also mentioned by Nick) gives you a number of important advantages. Assume that someone, possibly you, has managed to trip over the power cord, how long will it take you to get the server back up? If your partitions are Read/Write, then you will be doing a fsck on each of them. That means time. If your partitions are huge, then you will need a lot of RAM and time to preform the fsck. If you have a massive partition and insufficient RAM, then your fsck will fail (see FAQ-14.7 "fsck(8) time and memory requirements") and you'll be stuck like a turtle on it's back at a soup competition. The above is just your start up time after a crash or power loss. Assume that someone, possibly you, has written some bad code that will scribble all over the data in one of your partitions. How long will it take you to recover? If the partition was marked RO, then you don't have a problem. If it was a small RW partition, you can repair it reasonably quickly from backup. If your backup media fails, your losses are minimal. By comparison, if it's a huge RW partition, then you're stuffed. The list of reasons goes on and on but when you really think about it, you'll understand that you're just doing proper "risk management" by trying to mitigate as many of the bad effects of failures as possible. Never drink the marketing kool-aid that will try to sell you on the idea that failures are somehow avoidable. Sure, it might sound like a nice idea but the idea always falls short of reality. Being prepared for the reality of failures is a much better approach than sticking your head in the sand. /jcr
yeah, though fortunately most of it was in the form of confirmation of In addition to Tony and J.C.'s comments (I've edited them out for size, go back and read 'em if you haven't), let me add another really big reason: Growth and scalability. Usual logic goes something like this: "I need a lot of space, so I'm going to build a file system that has a lot of space in it", and you drop all that space into one file system. Efficient? For a while, yes. BUT, what about when it fills up? Usual response: "use a Volume Manager" or "Dump the data to a new, bigger disk system". Ok, the ability of some "volume managers" to dynamically increase the size of a file system is kinda cool, but I would argue that for many apps, it is just another way of saying, "The initial design SUCKED and I had more money than brains to fix the problem" (assuming one of the commercial products, of course). Somewhat over simplification, of course...but... Dumping the data from one disk to another is fine and dandy when you are talking about your 40G disk on your home or desktop computer, the fact that you are down for a few hours is no big deal. But what about a server? I don't care how fast your disks are, moving 300G of data to a new disk system is a lot of slow work. Here's a better idea: break your data into more manageable chunks, and design the system to fill those chunks AND make it easy to add more later. So, you implement today with 1TB of data space, broken up into two 500G chunks. Fill the first one, move on to the second one. Fill the second one, you bolt on more storage -- a process which will probably take minutes, not hours. When you bolt on more storage, you will be doing it in the future, when capacity is bigger and cost is less. Let's look at the machine I mentioned yesterday, our e-mail archive system: disks: Filesystem 1K-blocks Used Avail Capacity Mounted on /dev/wd0a 199358 49742 139650 26% / /dev/wd0e 1030550 6 979018 ...
As Nick tried to point out, buying all of the storage you'll eventually need for a project "right now" generally proves to be a waste of money over the long haul. Adding storage incrementally is normally cheaper. There is one, single, very sad exception to the above rule: Poorly Configured Hardware RAID sets. I've seen this problem a lot. Somebody decides they want to put "X" disks in a RAID-5 array, so they buy a hardware controller and the "X" disks, then configure the whole thing according to the controller directions... -Dumb Move. Since you followed along with your controller documentation to set up your RAID-5 across all the disks, you probably didn't notice that you used all of the space on each of the drives -- or you didn't think about the ramifications of using all of the space. Guess what? -If one of those disks fail, you may be stuffed. The reason is simply that you may not be able to get another disk of the exact same size -- EVEN IF IT HAS AN IDENTICAL PART NUMBER. Most people do not realize there are subtle differences between disks, yes, even between disks with the exact same part number. When manufactured, one batch of drives might allow for slightly more or less space than the next batch of drives manufactured. As long as the drives will hold at least "N" number of bits and they satisfy the marketing or packaging claims (i.e. "500 Gigbytes" or whatever), they get sold. So all of your original disks looked identical and allowed you to use the same exact size for the total space on each. And since you followed the directions, you used all of it. The trouble is finding a correct replacement when one of the originals fail. If you are really lucky, then you might get a replacement disk that is just slightly larger than the ones you originally purchased. In this case you waste a few megabytes when you configure it to only use the same size as the others in the RAID. The same is sometimes true if you can buy a replacement disk with greater capacity ...
Hello List, We're the proud new owner of a 10x750GB appliance. We're going to put OpenBSD on it and I was looking for suggestions or feedback on a configuration we were considering. This server is going to be stored at our colo and we have a point to point T1 directly connected to it. (We're going to initially populate it here and only have to rsync daily differences after hours.) Luca-Brozzi.ad2.com --------------------- Partition Size(GB) / 2 swap 8 /usr 4 /usr/local 4 /usr/obj 4 /usr/src 4 /var 2 /home 20 /tmp 2 /backups/server1 400 /backups/server2 400 /backups/server3 400 /backups/server4 400 /backups/server5 400 /backups/server6 400 /backups/server7 400 /backups/server8 400 /backups/server9 400 Is this the best way to do it? Does anyone have suggestions on a better way to do it? Thanks, John
Hi, I believe in using the right tool for the job and, to be honest I wouldn't use OpenBSD for a large data store like that. If it were me I'd get a real SAN or NAS but you have what you have so my top choice would be an OS that you can run an Volume manager on, Linux with LVM2 or Veritas VM. FreeBSD has some Volume Management capabilities but I have no experience using them. Sorry if my answer offends you. Matt
On Thu, 10 May 2007 14:21:23 -0500 I second that, except for GNU/Linux and FreeBSD; I'd really recommend to run, if possible, Solaris and take advantage of ZFS with all its nice tools and features. Btw, can you specify what this appliance is? I have an EMC Cellerra at work which /could/ be used as a highly redundant and nice performing CIFS server (authentication to be done by another machine, though). We found this out after figuring out weeks of how to add a second/third machine to our *cough* RHEL *cough* server infrastructure to get a redundant setup (the file server is connected to another EMC, a 3TByte CX300, using FC) using 'a' cluster filesystem. This turned out to be a real PITA -- and then someone told us that the Cellerra can do this most conveniently. Guess what it is doing right now? It exports a 3TByte NFSv3 FS. geeeeeeees... To make a long story short: Really THINK VERY HARD on this setup. Once you decided which way you go and store 3TByte of data there (regardless of the way *how* you do it, using GNU/Linux, FreeBSD, Solaris or DR-DOS ;) be sure it will be a real PITA to get this corrected IF you have to... timo
> I'd really recommend to run, if possible, Solaris and take advantage of That's a great idea, I always think OpenBSD for everything but I don't want to know how long it would take to fsck 3.75TB. I'm going to go with Solaris w/ZFS. Thanks!
I'm inclined to agree here, at least until OpenBSD gets stable ffs2 support (allowing filesystems larger than 1tb), but until then, I'd really recommend going the GNU/Linux or FreeBSD route. Although I'd probably favor GNU/Linux with LVM for a large data store. Jimmy.
It really depends. The volume manager crowd have a point in that a volume manager can make it easier to do this sort of thing (supporting really large filesystems would work as well, but that's still being worked on). However, quite a few backup systems will happily stripe the backups across as many disks as you feed them; AMANDA can certainly do this, although it's not really a good fit for filesystem-based backups. I'd be wary of the 'one disk per server' method you use above, though; that's not likely to be a good map in the future. You might even want to consider mounting ~ 2TB ccds under, say, /disks and symlinking /backups/server1, ... to those, mostly for psychological reasons. You might want to consider various variants on RAID, too. This depends on the uptime requirements, obviously, but if this is the only place you'll store backups, you'll want to make sure a simple disk failure doesn't cause too much trouble. Otherwise, your non-backup directories are ridiculously large, but that's not really going to hurt you in this case, and taking this much storage offline for repartitioning would be painful. Joachim -- PotD: x11/xtraceroute - graphical version of traceroute
