Skip to content

Connect Cloudgene to a Hadoop Cluster

When you execute workflows with Hadoop steps, Cloudgene needs to know where the Namenode and Jobtracker are. These information are in the Hadoop config files in $HADOOP_HOME/conf on your Namenode. If you start Cloudgene on the Namenode then all configuration files are already located on the machine.

Copy the Configuration Files from Cluster to Cloudgene

When you plan to run Cloudgene on a different server than the Namenode, then you have to copy the confioguration files from the cluster to your Cloudgene instance.

Copy the following configuration files from the $HADOOP_HOME/conf on your Namenode to a directory on your Cloudgene instance (e.g. /path/to/cloudgene/hadoop-conf/):

  • core-site.xml
  • hbase-site.xml
  • hdfs-site.xml
  • hive-site.xml
  • mapred-site.xml
  • yarn-site.xml

If you are using Cloudera Manager, you can download the configuration of the MapReduce services directly via the web interface. A step by step guide can be found in this article.

Configure Cluster Connection

Next, you have to define the cluster and the path to its configuration files in your config/settings.yaml file:

  # cluster name
  name: Test Cluster
  # the cluster type. Currently, the only cluster supported is hadoop.
  type: hadoop
  # configuration files from your cluster
  conf: /path/to/cloudgene/hadoop-conf
  # username which should be used for job execution
  user: hadoop

Validate Cluster Connection

You can use the verify-cluster command to check if Cloudgene is able to establish a connection to your cluster:

cloudgene verify-cluster

You should see a message like Hadoop cluster is ready to use.

Next, restart the Cloudgene webserver, open the Admin Panel and click on the Server tab. In the section Resources you will see if Cloudgene was able to establish a connection to your Hadoop cluster as well as all nodes were found: