Pentaho

Customer Portal

Get a grip on your data

With battle-tested solutions and a focus on foundational strength,

Pentaho+ helps you meet the challenges of an AI-driven world.

Big Data and Pentaho

Your feedback is important to us! Email us how we can improve these documents.

Overview

This page serves as a library for each of the Pentaho Big Data Best Practices, Guidelines, and Techniques documents. You will find information to guide you through the uses, components, and standards that have been put in place to make sure you maximize use and performance.

Contents

  • Big Data Ingestion Patterns
  • Deploying Custom Step Plugins for Pentaho MapReduce 
  • Transformation Variables in Pentaho MapReduce
  • Configuring PDI, Pentaho MapReduce, and MapR
  • Big Data On-Cluster Processing with Pentaho MapReduce - updated
  • Parsing XML on PDI
  • R on Pentaho Data Integration (PDI)
  • Getting Started with Pentaho and Cloudera QuickStart VM
  • Pentaho Analyzer with Impala as a Data Source

The Components Reference in Pentaho Documentation has a complete list of supported software and hardware.

Big Data Best Practices and Guidelines

Big_Data_Ingestion_Patterns.jpg Big Data Ingestion Patterns
For versions 7.x, 8.x / published August 2019

The ways in which data can be set up, saved, accessed, and manipulated are extensive and varied. When setting up your data, choosing the format for your files is a process that requires applied thought. Some of the topics discussed here include data ingestion patterns, the data lake concept, and choosing a suitable file format.

Audience: Pentaho ETL developers and architects interested in learning data ingestion patterns, or anyone needing to bulk load data, change a schema, store intermediate data, or optimize query speed.

DepCustStepPlugins.png Deploying Custom Step Plugins for Pentaho MapReduce
For versions 6.x, 7.x, 8.x / published September 2018

Pentaho MapReduce (PMR) allows ETL developers to design and execute transformations that run in Hadoop MapReduce. If one or more custom steps are used in the mapper, reducer, or combiner transformations, additional configuration will be needed to make sure all dependencies are available for use in the Hadoop environment.

Audience: Pentaho Data Integration (PDI) developers or administrators configuring PMR on a Hadoop cluster.

TransVarInPMR.png Transformation Variables in Pentaho MapReduce
For versions 6.x, 7.x, 8.0 / published September 2018

Pentaho MapReduce (PMR) allows ETL developers to design and execute transformations that run in Hadoop MapReduce. When PMR jobs are run, additional variables are injected into the Kettle variable space that can be used to enhance transformation that map, combine, or reduce. Also covers examples for using TaskID.

Audience: Pentaho Data Integration (PDI) developers or administrators configuring PMR on a Hadoop cluster.

Configuring_PDI.jpg Configuring PDI, Pentaho MapReduce, and MapR
For versions 7.x, 8.x / published June 2019

This document is intended to provide insight and best practices for setting up Pentaho Data Integration (PDI) to work with MapR, and includes information about setting up and installing the MapR client tool that is required by PDI to run Pentaho MapReduce jobs. Also discusses setting up environments.

Certain configurations on the Hadoop ecosystem will be examined to make sure that the client is correctly setup before PDI will use it.

Audience: Cluster or server administrators, especially Hadoop administrators.

big_data_on_cluster_proc.png Big Data On-Cluster Processing with Pentaho MapReduce
For version 7.x, 8.x / published May 2019

Pentaho Data Integration (PDI) includes multiple functions to push work to be done on the cluster using distributed processing and data locality acknowledgment. This document covers best practices to push ETL processes to Hadoop-based implementations. MapReduce and YARN are also covered here.

Audience: Cluster or server administrators, solution architects, or anyone with a background in big data processing. 

Parsing_XML_on_PDI.jpg Parsing XML on PDI
For versions 7.x, 8.x / published April 2019

There are different techniques to process and parse Extensible Markup Language (XML) files stored in a Hadoop cluster. This best practice focuses on selecting and implementing the best strategy to parse XML for your use case.

Audience: Cluster or server administrators, solution architects, or anyone with a background in big data processing. 

r_on_pdi.png R on PDI
For version 6.x, 7.x, 8.0 / published December 2017

This document covers some best practices on integrating R with PDI, including how to install and use R with PDI and why you would want to use this setup.

Audience: Data analysts, data scientists, and PDI users who need to use the variety of statistical and machine learning tools available in the R environment.

pentaho_and_cloudera_quickstart_sm.png Getting Started with Pentaho and Cloudera QuickStart VM
For version 7.x, 8.x / published May 2018

This document covers some best practices on integrating Pentaho software with Cloudera QuickStart VM, including how to configure the QuickStart VM so that Pentaho can connect to it.

Audience: Pentaho developers and system architects looking to experiment with PDI and Hadoop.

analyzer_impala_sm.png Pentaho Analyzer with Impala as a Data Source
For version 6.x, 7.x, 8.0 / published February 2018

This is a collection of best practices on using Pentaho Analyzer with Impala data sources, including how to prepare and partition data and set configurations. You will also learn about schema recommendations and settings for Analyzer, Metadata, and the design and use of partitions.

Audience: Pentaho developers, system administrators, and architects.

 

 

 

 

 

 

 

 

 

Comments