Finding Documentation Becomes a Pain

Each time you go about setting up software you may encounter a series of issues or complications that can make the task much more difficult. In order to counteract these issues or complications, one can usually find some useful and comprehensive documentation relating to the problem and solutions on how to fix them. Perhaps the worst time someone can experience when trying to deploy a software is having trouble finding initial documentation in the first place.

Installing (or attempting to install Hadoop) this semester, has allowed me to gain a greater appreciation for documentation and complete step-by-step instructions. Looking through the documentation for Hadoop, I have noticed they have recently updated the software, and describe some of the advantages the newer version has. However, the install instructions they have seem to be for one of the original versions, and searching for other install instructions hasn’t yielded much help either. I’m sure most of you reading this blog can understand the frustration that must be associated with it. Finding the correct way to set up the base system so the different parts of Hadoop works has proven to be anything but a simple task.

At this point the strategy is to continue to look for any documentation that might be useful, use the pieces that I’ve found from different areas and put them together to see if something works, and look at contacting people who have set up the new system before. By following this plan, and devoting the necessary time to the project, should get results before the end of the semester.

At this point, my understanding of Hadoop is slowly growing and as I do certain things, I continue to update my own documentation which hopefully can be used to redeploy the system if necessary in the future.

Semester Update (24 November 2013)

Moving to Thanksgiving week and nearing the end of classes, here is a semester update of where I stand on the current independent study moving and setting up servers, and where I plan to go from here.

I’ll start by covering some of the tasks we have accomplished with the computer science servers.

  1. We have successfully installed and setup a new server for computer science
    1. Running on a newer version of VMWare
    2. Server installed on CentOS 6.4
    3. Used to host computer science wiki and blog
  2. Moved over computer science blog and plugins
  3. Changed IP for machines
  4. Git server setup (with Chad)
  5. Install base systems on computer science cluster

Although this list appears to be in a good place, there is still a good amount of work that needs to be done with the time we have left this semester. Setting up the base system on the machines in the computer science server room was a good start to building a Hadoop cluster, but the real work lies in setting up the cluster itself and learning how it operates. After installing the cluster on the host machines, I plan on looking into how faculty can use the cluster for examples in class as well as some good assignment ideas, but that will be dealt with after the cluster itself is setup. The remaining list and order looks something like this:

  1. Present on current status of Independent Study
  2. Installing Hadoop cluster on machines
    1. Write and update documentation of the install
  3. Testing for any conflicts with Eucalyptus Cluster (with Chad)
  4. Testing and developing practical understanding of Hadoop cluster
  5. Developing examples/assignments professor’s can use with the Hadoop cluster

The first item on the list is to present what I’ve been doing this semester at the presentation computer science is holding to go over independent studies and internships (3 December 2013, ST-107). This is a good time to share with other students and faculty what’s been going on this semester and what plans I have for continuing the work.

It may be a bit difficult to finish the planned material before the end of finals (19 December 2013), but I will be continuing to work until the work is complete. More updates will continue as tasks on the list are completed.

What is Hadoop and Why are we Interested

Recently the computer science department at Worcester State has chosen to add two tracks to the major. One track is software development, which is similar to the current single track, and the other is Big Data Analytics. Because of the new track system, courses offered had to be changed which meant a change in a few course materials. Part of my independent study this semester has been to implement a Hadoop cluster, which is a tool that can be used for data analytics.

Part of big data analytics is dealing with very large sets of data. Many times, especially when we think of companies like Google or Amazon, it becomes evident that a single machine, or a couple of machines won’t get the job done. This is when distributing tasks throughout a series of machines works much more efficiently. One software platform that has been designed for this job is Apache Hadoop, which is installed on a cluster of machines that can handle large sets of data and jobs associated with this data much more efficiently.

From the Hadoop website (hadoop.apache.org):

“The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.”

In this independent study, I am looking to implement a cluster running Hadoop and testing some applications of this. By the end of the semester, the goal is to have a fully functioning cluster that can be used in the courses that deal with big data analytics, and have a complete set of install instructions that can be used to rebuild the Hadoop cluster at Worcester State in the event something goes wrong (something which is currently tough to obtain).

Keep checking for updates about the install, setup, and use of Hadoop here at Worcester State University.

Learning the Hadoop Install

After discussing with Chad Day the possible solutions for setting up both a Hadoop cluster and an Eucalyptus cluster on the same set of machines, we have concluded we should first attempt to run the services side by side. We are planning on using CentOS 6.4 as the operating system on the machines, and then installing the services we need for the nodes. The nine machines we have now currently have operating systems and packages on them from the 401 class that focused on working with Eucalyptus. Because of this Chad and I will install the latest version, 6.4, of CentOS on the machines and install the necessary packages for the servers.

Before attempting to install the Hadoop software on the CS machines, I am planning to set up a small environment of Hadoop servers on my personal machine using VirtualBox. This gives me a little more flexibility to play with packages and will be easier to access as I am learning the install process and how the head node and data nodes work together. Once I am confident with the install process on the virtual machines also running CentOS 6.4, I will install the packages on the machines that will have both Hadoop and Eucalyptus.

As far as installing the base operating system on the nine machines, Chad and I will install CentOS on the machines this week, hoping to have all the base system on all the machines by the end of this upcoming week. My plan is to have a couple install medias, either several CD’s or USB drives with the CentOS image on it to help the install go a little quicker. After all the machines are set up and I’ve played around in the virtual environment, I will be ready to install the cluster that will be used in the computer science department.

Successful Move of the Blog and Wiki

I am pleased to report a seamless move to the new wiki and blog server for the Worcester State University Computer Science Department! After collaborating with Dr. Karl R. Wurst and Chad Day, we moved the ip of the old wiki and blog server to the new server and assigned another ip to the old one. After running tests on both the wiki and the blog, cs.worcester.edu/wiki and cs.worcester.edu/blog are working as expected.

At this point in the semester, the blog and wiki have been migrated to an updated operating system. However, we are still looking into some possibilities on how to secure the wiki from potential spammers. One such way is to look into the code and limit email registration to the worcester.edu domain, which will be looked into by Chad. The git server is up and running and the CS department is in the process of testing it for usage in classes for homework submission and a good way to teach version control early in the major.

Looking forward, Chad and I will be building two clusters with the Computer Science machines, an Eucalyptus cluster and a Hadoop cluster to be used throughout the semester. Before we install the software on the machines, we must investigate the best solution for sharing hardware. We currently have nine machines at our disposal, and we must partition them for the clusters. The possibilities have investigated so far include:

  1. Splitting the physical allotment in half –  4 machines for each cluster
  2. Installing the correct services along side each other, each cluster share the nine machines and share resources
  3. Installing a virtual host on the machines and running two machines on each host, each cluster would have 9 machines and share resources, but would remain independent of each other.

We should have a strategy and start formatting the host machines by the end of next week, followed by another update.

Migration of cs.worcester.edu/blog

One of the tasks for this fall semester for the server migration/creation study was to move the Worcester State Computer Science blog to a new server alongside a new wiki platform. For both servers, WordPress will be used to manage the blogs, and all posts will be aggregated from other sites. In moving the blog to a new server the following steps were followed:

  1. Investigate the structure of the original server
    1. Web directory stored in a separate lampp folder
    2. MySQL server run from lampp folder
  2. Choose a structure for the new server
    1. Use a standard install of MySQL and Apache
    2. Set up Apache configuration that will work when the ip is changed
    3. Create a MySQL database that matches old server
    4. Create a MySQL user with the same credentials and permissions as the old server
  3. Move the old web directory to the new server
  4. Move the database to the new server
    1. Make a database dump of the WordPress database
    2. Move dump to new server and restore to newly created database
  5. Check that web directory works as expected with migrated database settings
  6. Find admin and home URLs settings in database and update them through the database
    1. This allows you to use the admin page on the new server instead of redirecting to the old admin page
    2. Change any necessary settings on the new page
  7. Run a final test to make sure everything is up to date

Following the steps above, the migration of the server was straightforward and was about ready for an ip change. The only issue with the migration was the updating of the pages (ex. Order of the Rubber Duck) did not carry over. After investigating the database for possible URL conflicts a few settings were changed with URL and as soon as a scheduled check was run by WordPress, the pages were restored. 

A final check was made with the blogs, navigation and aggregation are all set, and a scheduled update of the ip address to match cs.worcester.edu will occur tomorrow, October 25, 2013. After the ip has been migrated a thorough check will be made and if everything is a success, the new server will be used for the blogs.

Blogging for Updates

Attending my final semester at Worcester State University in computer science has given me a great opportunity to be in an independent study where I can work on something that will benefit the program. Throughout the semester I will move over the current CS blog from an outdated operating system to the latest CentOS, help set up a git server using gitlab, and research and implement a hadoop cluster.

The anticipation this semester is that the work being done will be used in the computer science department and will help courses offered in the future. Writing solid documentation throughout the semester will allow maintaining, and possibly duplicating the results as easy as possible. In order to keep the computer science department in the loop, I will be blogging weekly with the current status of the project, problems encountered, cool bits of information I’ve come about, and research finds.

This semester allows me to stay active in the computer science department while finishing up my final liberal arts requirements. It should be both a fun and productive semester, and in the end there should be some solid results.