Hitachi Vantara Pentaho Customer Portal

Welcome to Pentaho Support

Big Data and Pentaho

Your feedback is important to us! Email us how we can improve these documents.

Overview

This page serves as a library for each of the Pentaho Big Data Best Practices, Guidelines, and Techniques documents. You will find information to guide you through the uses, components, and standards that have been put in place to make sure you maximize use and performance.

Contents

  • Big Data Ingestion Patterns
  • Configuring PDI, Pentaho MapReduce, and MapR
  • Integrating Pentaho with MapR using Apache Drill
  • Pentaho Big Data On-Cluster Processing
  • Parsing XML on PDI
  • R on Pentaho Data Integration (PDI)
  • Getting Started with Pentaho and Cloudera QuickStart VM

The Components Reference in Pentaho Documentation has a complete list of supported software and hardware.

Big Data Best Practices and Guidelines

Big_Data_Ingestion_Patterns.jpg Big Data Ingestion Patterns
For versions 7.x, 8.x / published May 2018

The ways in which data can be set up, saved, accessed, and manipulated are extensive and varied. When setting up your data, choosing the format for your files is a process that requires applied thought. Some of the topics discussed here include data ingestion patterns, the data lake concept, and choosing a suitable file format.

Audience: Pentaho ETL developers and architects interested in learning data ingestion patterns, or anyone needing to bulk load data, change a schema, store intermediate data, or optimize query speed.

Configuring_PDI.jpg Configuring PDI, Pentaho MapReduce, and MapR
For versions 6.x, 7.x, 8.0 / published April 2018

This document is intended to provide insight and best practices for setting up Pentaho Data Integration (PDI) to work with MapR, and includes information about setting up and installing the MapR client tool that is required by PDI to run Pentaho MapReduce jobs.

Certain configurations on the Hadoop ecosystem will be examined to make sure that the client is correctly setup before PDI will use it.

Audience: Cluster or server administrators, especially Hadoop administrators.

 Integrating_PDI_with_MapR.jpg Integrating Pentaho with MapR using Apache Drill 
For version 6.x, 7.x, 8.0 / published April 2018

There is some flexibility in integrating Pentaho Data Integration (PDI) with MapR, when you need to do a lot of data integration work, through a combination of factors. This document will teach you how to configure Apache Drill for PDI and how to connect PDI to Drill.

Audience: Cluster or server administrators, with MapR Converged Data Platform running with Apache Drill.

 Pentaho_Big_Data_On-Cluster.jpg Pentaho Big Data On-Cluster Processing
For version 6.x, 7.x / published August 2017

Pentaho Data Integration (PDI) includes multiple functions to push work to be done on the cluster using distributed processing and data locality acknowledgment. This document covers best practices to push ETL processes to Hadoop-based implementations. 

Audience: Cluster or server administrators, solution architects, or anyone with a background in big data processing. 

Parsing_XML_on_PDI.jpg Parsing XML on PDI
For versions 7.x, 8.0 / published February 2018

There are different techniques to process and parse Extensible Markup Language (XML) files stored in a Hadoop cluster. This best practice focuses on selecting and implementing the best strategy to parse XML for your use case.

Audience: Cluster or server administrators, solution architects, or anyone with a background in big data processing. 

r_on_pdi.png R on PDI
For version 6.x, 7.x, 8.0 / published December 2017

This document covers some best practices on integrating R with PDI, including how to install and use R with PDI and why you would want to use this setup.

Audience: Data analysts, data scientists, and PDI users who need to use the variety of statistical and machine learning tools available in the R environment.

pentaho_and_cloudera_quickstart_sm.png Getting Started with Pentaho and Cloudera QuickStart VM
For version 7.x, 8.x / published May 2018

This document covers some best practices on integrating Pentaho software with Cloudera QuickStart VM, including how to configure the QuickStart VM so that Pentaho can connect to it.

Audience: Pentaho developers and system architects looking to experiment with PDI and Hadoop.

 

 

 

 

 

 

 

 

 

Have more questions? Submit a request

Comments

Powered by Zendesk