Software

Software available on BigData@PoliTO laboratory

Each node of the BigData@Polito cluster runs a Cloudera distribution on Linux Ubuntu (14.04.02 LTS). The Cloudera distribution is based on the open source Apache Hadoop framework for Big Data distributed applications. The Apache Hadoop ecosystem includes Apache open source projects for data management like YARN and HDFS (Hadoop Distributed File System), governance/integration lile Falcon, Flume and Sqoop, operations like Ambari, Oozie and ZooKeeper, data access like Hive, Pig, Mahout, Storm, Spark. This was only a sublist and new projects can be included in future.

 listsw

Ubuntu Debian-based Linux operating system
Cloudera 100% Open Source Distribution including Apache Hadoop
Apache Hadoop Framework that allows for the distributed processing of large data sets across clusters of computers
YARN Framework for job scheduling and cluster resource management
HDFS Distributed file system that provides high-throughput access to application data
Falcon Feed management and data processing platform
Flume Distributed service for collecting, aggregating and moving large amounts of log data
Sqoop Tool for transferring data between Hadoop and structured datastores such as relational databases
Ambari Web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters
Oozie Workflow scheduler system to manage Apache Hadoop jobs
ZooKeeper Centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services
Hive Data warehouse infrastructure that provides data summarization and ad hoc querying
Pig High-level data-flow language and execution framework for parallel computation
Mahout Scalable machine learning and data mining library
Storm Distributed realtime computation system
Spark Fast and general compute engine for Hadoop data