11

I have a basic cloud running on Ubuntu Server (9.04) and Eucalyptus. Walrus (Eucalyptus' API compatable S3 implementation) stores files by on the cloud controller. However each of the other 4 server have 1TB storage which is largely unused. I am looking for a way to pool all the storage together in order to make use of all available resources. I have been loooking at various options including PVFS, Lustre, HDFS (Hadoop).

My only requirments are that it need be scalable and that it runs well on Ubuntu. I would appreciate hearing from anyone who has experience with such technologies and I look forward to hearing your suggestions.

Jaunty
  • 151

11 Answers11

5

While I haven't personally implemented it anywhere in our systems, I have looked pretty extensively at Gluster. I know a few people at some large sites that use this and it apparently works really well. They use it in production for some heavy duty HPC applications.

Kamil Kisiel
  • 12,444
2

GlusterFS would seem like the ideal solution to me. To the guy who claims that Gluster takes lots of effort to set up I've got to say that he's probably never tried. As of Gluster 3.2 the configuration utilities are pretty awesome and it takes 2 or 3 commands to get a gluster volume up and sharing on the network. Mounting gluster volumes is equally simple.

On the plus side it also gives you a lot more flexibility than NFS. It does striping, relication, georeplication, is of course POSIX compliant and so on. There is an extension called HekaFS, which also adds SSL and more advanced Authentification mechanisms, which is probably interesting for cloud computing. Also it scales! It is F/OSS and is being developed by RedHat who've recently purchased Gluster.

juwi
  • 591
1

Putting some sort of shared filesystem behind a virtualization environment is pretty common. You have lots of choices, depending on what you're looking to accomplish.

The simplest solution is probably NFS, because this is going to be supported natively by whatever distribution you're running. NFS can perform reasonably well as a virtualzation backend filesystem, although it's not going to be the fastest thing out there.

If you're running a RedHat (or derivative) cluster, you'll have good out-of-the-box support for GFS2, RedHat's cluster filesystem. This doesn't scale up to hundreds of nodes, but it's fine for smaller clusters.

Beyond that, you're starting to enter the range of things like Lustre, Glusterfs, GPFS, and so forth. These are all high-performance parallel filesystems, but they require substantially more work to set up than the other options here. If you have a large environment they may be worth looking at.

larsks
  • 47,453
1

i'd agree with @larsks in that NFS is the best option; set up some iSCSI targets, NFS, done. this will scale to about 5-10 nodes; YMMV based on I/O, network capability, etc. (alternatively, set up iSCSI with multipath I/O support).

If you need something about 20+ nodes, you may want to investigate Ceph. Lustre is promising and stable, but is an (F/OSS) Oracle product and I have personal dislikes against Oracle. :)

Ceph is also quite active; the most recent release was 5 days ago.

brent saner
  • 440
  • 2
  • 7
1

Have you ever looked at mogileFS? http://danga.com/mogilefs/

It's not a file system in the traditional sense, but it is good for distributing file data across a cluster (with replication and redundancy taken into account).

If you're serving up files for a web application you will need something to serve the files. I would suggest a PHP script that uses the HTTP request as the search key for finding the file you want in the mogile FS. You can then read the contents of the file into a buffer and echo/print it out.

MogileFS is already pretty quick, but you can combine mogileFS with memcache to speed up access to the most commonly used files.

1

With Lustre you have to have a special kernel on the servers, and I would only have the servers being servers and nothing else.

Strangely the most sane answer much well be NFS. We have used NFS on Amazon's cloud. It may not scale as well as some file systems but the simplicity should not me overlooked. A single name space is probably not worth the effort it would take to implement.

James
  • 2,252
1

XtreemFS could be a solution for you. It is fairly simple to install and configure, there are also packages for Ubuntu.

blong
  • 144
  • 1
  • 1
  • 13
1

Are you still looking into HDFS? One of the Cloudera guys gave a talk at VelocityConf this year about Hadoop and HDFS focused on managing big data clusters, so he talked about HDFS quite a bit. The slides are pretty informative. I haven't worked with HDFS personally, but I talked with some random folks at Velocity that are using it on Ubuntu to do various data analysis.

jtimberman
  • 7,665
1

MooseFS (Distributed File System) fits to your requirements. It is scalable and works well on Ubuntu. It might also be useful for you to see how to install install/update MooseFS from officially supported repository on Ubuntu.

TechGeek
  • 161
0

Not sure what you're doing, but this sounds like a potentially interesting application for CouchDB.

duffbeer703
  • 22,305
0

You could try PVFS2. It's much easier to set up than Lustre, and generally faster than Gluster.

wazoox
  • 7,156