Pentaho

Customer Portal

Get a grip on your data

With battle-tested solutions and a focus on foundational strength,

Pentaho+ helps you meet the challenges of an AI-driven world.

Pentaho Data Integration

Your feedback is important to us! Email us how we can improve these documents.

Overview

This page serves as a library for each of the Pentaho Data Integration (PDI) Best Practices, Guidelines, and Techniques documents. It contains helpful information for the many uses, components and standards that have been developed to help you accomplish optimal use and performance.

Contents

  • DevOps with Pentaho 
    • Driving PDI Project Success with DevOps - updated!
    • PDI Project Setup and Lifecycle Management - updated!
    • Continuous Integration for Pentaho Data Integration - updated!
  • Design Standards and Guidelines
    • Design Guidelines for PDI
    • Naming Standards for PDI - updated!
    • Standards for Lookups, Joins, and Subroutines in PDI - updated!
  • How-To Guides
    • Realtime Data Processing with PDI - updated!
    • Fixed Width Data in PDI
    • Guidelines for Metadata Injection in PDI - updated!
    • Restartability in PDI - updated!
  • Working with the PDI Repository
    • Dividing Large Repositories in PDI  - updated!
    • Logging and Monitoring in PDI
    • Performance Tuning in PDI
    • R on PDI
    • Pentaho Analyzer with Impala as a Data Source
    • Pentaho and QuickStart VM
    • Using PDI with HCP

The Components Reference in Pentaho Documentation has a complete list of supported software and hardware.

DevOps with Pentaho 

DevOps is a set of practices centered around communication, collaboration, and integration between software development and IT operations teams and automating the processes between them. Its main objective is to reduce the time between committing a solution increment and having that increment available in production, while ensuring high quality.

DrivingPDIwithDevOps.jpg 

1. Driving PDI Project Success with DevOps
For versions 7.x, 8.x, 9.0 / published March 2020

This document introduces the Pentaho Data Integration DevOps series: Best Practices documents whose main objective is to provide guidance on creating an automated environment where iteratively building, testing, and releasing a Pentaho Data Integration (PDI) solution can be faster and more reliable, resulting in a high-quality solution that meets customer expectations at a functional and operational level.

Audience: Pentaho Administrators , developers, and architects, as well as IT professionals who help plan software development. 

 projsetup_sm.png 2. PDI Project Setup and Lifecycle Management
For versions 7.x, 8.x, 9.0 / published March 2020

Starting your Data Integration (DI) project means planning beyond the data transformation and mapping rules to fulfill your project’s functional requirements. A successful DI project proactively incorporates design elements for a DI solution that not only integrates and transforms your data in the correct way but does so in a controlled manner.

This document serves as a foundation on which you can build your development, CI, and CD strategies. We cover best practices for starting your PDI project, including project structure and the project’s technical governance and lifecycle management strategy.

Audience: Pentaho developers or anyone who is interested in setting up and improving PDI projects.

ContIntwithPDI.jpg 

3. Continuous Integration with Pentaho Data Integration 
For versions 7.x, 8.x, 9.0 / published March 2020

This document introduces the foundations of Continuous Integration (CI) for your Pentaho Data Integration (PDI) project. It is the third document in the PDI DevOps series, and provides example and instructions geared toward a situation where you are using Git as a code repository, Jenkins as an automation server, and Junit as the test framework. You can use this as a model to build your own configuration using the same principles found throughout the series.

Audience: Pentaho Administrators , developers, and architects, as well as IT professionals who help plan software development. 

 

 

Design Standards and Guidelines

PDI_lib5.jpg Design Guidelines for PDI
For versions 6.x, 7.x, 8.0 / published December 2017

PDI Design Guidelines will show you how to create transformations and jobs for maximum speed, reuse, portability, maintainability, debugging, and knowledge transfer.

Audience: Pentaho ETL developers and architects, or anyone who is interested in learning PDI development patterns.

PDI_lib3.jpg Naming Standards for PDI
For versions 7.x, 8.x, 9.0 / published March 2020

PDI contains step naming standards and help with choosing the appropriate steps in certain situations.

Audience: PDI users or anyone with a background in ETL development who is interested in learning PDI development patterns.

PDI_lib4.jpg Standards for Lookups, Joins, and Subroutines in PDI
For versions 7.x, 8.x, 9.0 / published March 2020

Lookups, joins, and subroutines for PDI are covered in this document through best practices.

Audience: PDI users or anyone with a background in ETL development who is interested in learning PDI development patterns.

 

How-To Guides

RealtimePDI.png Realtime Data Processing with PDI 
For version 8.x, 9.0 / published March 2020

This is part of a webinar covering best practices on designing and building your PDI transformations and jobs for maximum speed, reuse, portability, and debugging.

Audience: Solution architects and designers, or anyone with a background in real time data ingestion, or messaging systems like JMS.

 
PDI_lib1.jpg Fixed Width Data in PDI
For versions 6.x, 7.x, and 8.0 / published December 2017

PDI offers the Fixed File Input step for reading fixed-width text files. On the output side, there is no step dedicated to this specific purpose, but fixed-width text can still be written using the existing Text file output step. This document walks you through the changes you will need to make to the default column metadata to successfully accomplish this task.

Audience: Data analysts and ETL developers who need to write fixed-width data.

PDI_lib14.jpg Guidelines for Metadata Injection in PDI
For versions 7.x, 8.x, 9.0 / published March 2020

Here, you will find best practices for using template-driven designs and for navigating and operating the different levels of metadata injection. We have provided an example of how to build the data-driven rule extract/transform/load (ETL) transformation and make it flexible so that it can be added to, changed, or removed without adding development cycles.

Audience: Customers in need of tips for using template-driven designs, and for building the template-driven ETL transformation.

PDI_lib2.jpg Restartability in PDI
For versions 7.x, 8.x, 9.0 / published February 2020

Restartability architecture in PDI jobs and transformations plays a key role in restarting a failed ETL process where it left off. This is true whether you need to avoid duplicate entries in the target database, or you are simply seeking overall ETL efficiency and do not want to rerun processes that completed successfully in the previous run.

Audience: Pentaho ETL developers and architects, or anyone who is interested in learning PDI development patterns.

 

Working with the PDI Repository

PDI_lib16.jpg Dividing Large Repositories in PDI
For versions 6.x, 7.x, 8.x, 9.0 / published March 2020

Performance and management problems are addressed here by detailing an automated method to improve performance with segmenting a single repository into several smaller repositories using PDI.

Audience: Customers looking to create a single PDI repository for maintaining many environments.

PDI_lib15.jpg Logging and Monitoring in PDI
For versions 6.x, 7.x, 8.x / published February 2018

The main objective of this document is to provide information about the different options for best practices for PDI logging. Some of the things discussed here include reasons for using PDI logging, levels of logging, transformation and job logging, and debugging transformations and jobs.

Audience: Customers or developers who may be interested in using PDI logging.

PDI_lib18.jpg Performance Tuning in PDI
For versions 5.4, 6.x, 7.x / published August 2017

This guide provides an overview of factors that can affect the performance of PDI jobs and transformations, and it provides a methodical approach to identifying and addressing bottlenecks.

Audience: Customers who seek assistance with finding and fixing bottlenecks in their PDI performance.

r_on_pdi.png R on PDI
For version 6.x, 7.x, 8.0 / published December 2017

This document covers some best practices on integrating R with PDI, including how to install and use R with PDI and why you would want to use this setup.

Audience: Data analysts, data scientists, and PDI users who need to use the variety of statistical and machine learning tools available in the R environment.

analyzer_impala_sm.png Pentaho Analyzer with Impala as a Data Source
For version 6.x, 7.x, 8.0 / published February 2018

This is a collection of best practices on using Pentaho Analyzer with Impala data sources, including how to prepare and partition data and set configurations. You will also learn about schema recommendations and settings for Analyzer.

Audience: Pentaho developers, system administrators, and architects.

pentaho_and_cloudera_quickstart_sm.png Pentaho and QuickStart VM
For version 7.x, 8.x / published May 2018

This document covers some best practices on integrating Pentaho software with Cloudera QuickStart VM, including how to configure the QuickStart VM so that Pentaho can connect to it.

Audience: Pentaho developers and system architects looking to experiment with PDI and Hadoop.

 PDI_with_HCP_sm.png Using PDI with HCP - new!
For version 8.3 / published November 2019

PDI can be used to move objects to and from Hitachi Content Platform (HCP). Starting in PDI 8.3, we have introduced native steps that give you the ability to query HCP objects as well as read and write object metadata, allowing you to create ETL solutions for batch file processing directly in the HCP environment.

Audience: ETL developers, BI developers, storage administrators, or anyone with a background in database development, BI, or IT storage who is interested in working with files stored in HCP using PDI.

 

 

 

 

 

 

 

 

 

 

 

 

Comments