Which Distributed File System as a backend for Cloud Computing?

Question

I have a basic cloud running on Ubuntu Server (9.04) and Eucalyptus. Walrus (Eucalyptus' API compatable S3 implementation) stores files by on the cloud controller. However each of the other 4 server have 1TB storage which is largely unused. I am looking for a way to pool all the storage together in order to make use of all available resources. I have been loooking at various options including PVFS, Lustre, HDFS (Hadoop).

My only requirments are that it need be scalable and that it runs well on Ubuntu. I would appreciate hearing from anyone who has experience with such technologies and I look forward to hearing your suggestions.

score 5 · Answer 1 · answered Jul 05 '09 at 16:44

While I haven't personally implemented it anywhere in our systems, I have looked pretty extensively at Gluster. I know a few people at some large sites that use this and it apparently works really well. They use it in production for some heavy duty HPC applications.

score 2 · Answer 2 · answered Jan 19 '12 at 21:47

GlusterFS would seem like the ideal solution to me. To the guy who claims that Gluster takes lots of effort to set up I've got to say that he's probably never tried. As of Gluster 3.2 the configuration utilities are pretty awesome and it takes 2 or 3 commands to get a gluster volume up and sharing on the network. Mounting gluster volumes is equally simple.

On the plus side it also gives you a lot more flexibility than NFS. It does striping, relication, georeplication, is of course POSIX compliant and so on. There is an extension called HekaFS, which also adds SSL and more advanced Authentification mechanisms, which is probably interesting for cloud computing. Also it scales! It is F/OSS and is being developed by RedHat who've recently purchased Gluster.

score 1 · Answer 3 · answered May 10 '11 at 16:49

Putting some sort of shared filesystem behind a virtualization environment is pretty common. You have lots of choices, depending on what you're looking to accomplish.

The simplest solution is probably NFS, because this is going to be supported natively by whatever distribution you're running. NFS can perform reasonably well as a virtualzation backend filesystem, although it's not going to be the fastest thing out there.

If you're running a RedHat (or derivative) cluster, you'll have good out-of-the-box support for GFS2, RedHat's cluster filesystem. This doesn't scale up to hundreds of nodes, but it's fine for smaller clusters.

Beyond that, you're starting to enter the range of things like Lustre, Glusterfs, GPFS, and so forth. These are all high-performance parallel filesystems, but they require substantially more work to set up than the other options here. If you have a large environment they may be worth looking at.

score 1 · Answer 4 · answered May 10 '11 at 18:00

i'd agree with @larsks in that NFS is the best option; set up some iSCSI targets, NFS, done. this will scale to about 5-10 nodes; YMMV based on I/O, network capability, etc. (alternatively, set up iSCSI with multipath I/O support).

If you need something about 20+ nodes, you may want to investigate Ceph. Lustre is promising and stable, but is an (F/OSS) Oracle product and I have personal dislikes against Oracle. :)

Ceph is also quite active; the most recent release was 5 days ago.

score 1 · Answer 5 · answered Jul 05 '09 at 15:34

Have you ever looked at mogileFS? http://danga.com/mogilefs/

It's not a file system in the traditional sense, but it is good for distributing file data across a cluster (with replication and redundancy taken into account).

If you're serving up files for a web application you will need something to serve the files. I would suggest a PHP script that uses the HTTP request as the search key for finding the file you want in the mogile FS. You can then read the contents of the file into a buffer and echo/print it out.

MogileFS is already pretty quick, but you can combine mogileFS with memcache to speed up access to the most commonly used files.

score 1 · Answer 6 · answered Jul 05 '09 at 18:47

With Lustre you have to have a special kernel on the servers, and I would only have the servers being servers and nothing else.

Strangely the most sane answer much well be NFS. We have used NFS on Amazon's cloud. It may not scale as well as some file systems but the simplicity should not me overlooked. A single name space is probably not worth the effort it would take to implement.

score 1 · Answer 7 · edited Sep 09 '15 at 04:45

1

XtreemFS could be a solution for you. It is fairly simple to install and configure, there are also packages for Ubuntu.

edited Sep 09 '15 at 04:45

blong

144
1
1
13

answered Jul 06 '09 at 14:59

score 1 · Answer 8 · answered Jul 09 '09 at 08:51

Are you still looking into HDFS? One of the Cloudera guys gave a talk at VelocityConf this year about Hadoop and HDFS focused on managing big data clusters, so he talked about HDFS quite a bit. The slides are pretty informative. I haven't worked with HDFS personally, but I talked with some random folks at Velocity that are using it on Ubuntu to do various data analysis.

score 1 · Answer 9 · answered Jun 25 '18 at 16:04

1

MooseFS (Distributed File System) fits to your requirements. It is scalable and works well on Ubuntu. It might also be useful for you to see how to install install/update MooseFS from officially supported repository on Ubuntu.

answered Jun 25 '18 at 16:04

TechGeek

161

score 0 · Answer 10 · answered Jul 05 '09 at 21:49

0

Not sure what you're doing, but this sounds like a potentially interesting application for CouchDB.

answered Jul 05 '09 at 21:49

duffbeer703

22,305

score 0 · Answer 11 · answered Jul 06 '09 at 10:20

0

You could try PVFS2. It's much easier to set up than Lustre, and generally faster than Gluster.

answered Jul 06 '09 at 10:20

wazoox

7,156

Which Distributed File System as a backend for Cloud Computing?

11 Answers11

Linked

Related