This project is read-only.

Hadoop-like framework with VFP

Topics: New Project Idea
Oct 27, 2011 at 5:09 AM
Edited Oct 27, 2011 at 5:18 AM

Hadoop has been making a lot of noise lately, and the more I read about it, the more I think VFP could easily be used quite easily to make a competitive framework with a superior programming model. This may sound crazy, but bear with me.

People started moving away from VFP several years ago when the amount of data companies needed to manage far exceeded VFP's 2GB file size limit. This started happening way before Microsoft officially killed the product. SQL Server became the natural DBMS VFP developers migrated to. While this eliminated the file size limit, you lost one of the true benefits of VFP's local database engine: the ability to use embedded SQL. You had to use SQL pass through or use remote views defined in a database container. Now, we've reached a point where it's become too cumbersome to manage all your data in a single DBMS. Now we hear buzzwords like "sharding" and "NoSQL". Persisting data in seperate files spread out over several (or even thousands) of physical machines is the implementation strategy behind the various NoSQL solutions that have come en vogue. Although Hadoop isn't necessarily a NoSQL database, the main purpose is to process smaller chunks of data distributed across thousands of physical machines using map reduce. Well, hell. If you're going to store data in smaller chunks across thousands of physical machines, what better way to store such data than in VFP free tables?

I can think of a number of reasons why VFP would make a perfect tool for designing a map reduce framework to challenge Hadoop:

  • VFP applications have a tiny footprint and low impact deployment requirement (you can pretty much copy the 4MB vfp9r.dll with your application's executable to a folder) and your application will run. When you need to deploy the application on thousands of machines, simple deployment is a big deal.
  • VFP's embedded SQL support allows query result sets to be output to free tables that have a file presence. This allows the map processes to create intermediate result sets that the reduce process(es) can easily access and aggregate
  • The VFP language is an interpreted language and can compile source code into p-code on the fly. Why is this important? Because the key programming paradigm of Hadoop is to "bring the code to the data" rather than the other way around. This allows you to simply copy program files (as jobs) to whatever machine needs to run the map process OR you can log the source code to execute the job in a memo field of a central job table that all the map processes can access. You can do the same thing with the machine(s) handling the reduce process.
  • Rushmore. I don't need tout VFP's optimization technology, but I will say that you still want to be able to query the individual data stores of your distributed system as quickly as possible.

Now there's still a 2GB file size limit for a free table in VFP, but if you're already sharding your data across several physical files anyway, a 2GB cap in file size doesn't really matter. The only real technical disadvantage that I can think of is that VFP is a 32-bit application, so it can't use more than 4GB of RAM. However, if you're running your map reduce jobs on cheap, commodity machines, perhaps it's not a real problem since you might not even want more than 4GB RAM on each machine. Obviously, the one economic disadvantage that you can't get around is that each machine running the map process has to be running Windows, which ain't free unlike Linux.

What do you all think?