Your feedback is important to us! Email us how we can improve these documents.
This page serves as a library for each of the Pentaho Data Integration (PDI) Best Practices, Guidelines, and Techniques documents. It contains helpful information for the many uses, components and standards that have been developed to help you accomplish optimal use and performance.
- DevOps with Pentaho
- Driving PDI Project Success with DevOps - updated!
- PDI Project Setup and Lifecycle Management - updated!
- Continuous Integration for Pentaho Data Integration - updated!
- Design Standards and Guidelines
- Design Guidelines for PDI
- Naming Standards for PDI - updated!
- Standards for Lookups, Joins, and Subroutines in PDI - updated!
- How-To Guides
- Realtime Data Processing with PDI - updated!
- Fixed Width Data in PDI
- Guidelines for Metadata Injection in PDI - updated!
- Restartability in PDI - updated!
- Working with the PDI Repository
- Dividing Large Repositories in PDI - updated!
- Logging and Monitoring in PDI
- Performance Tuning in PDI
- R on PDI
- Pentaho Analyzer with Impala as a Data Source
- Pentaho and QuickStart VM
- Using PDI with HCP
The Components Reference in Pentaho Documentation has a complete list of supported software and hardware.
DevOps with Pentaho
DevOps is a set of practices centered around communication, collaboration, and integration between software development and IT operations teams and automating the processes between them. Its main objective is to reduce the time between committing a solution increment and having that increment available in production, while ensuring high quality.
Design Standards and Guidelines
|Realtime Data Processing with PDI
For version 8.x, 9.0 / published March 2020
This is part of a webinar covering best practices on designing and building your PDI transformations and jobs for maximum speed, reuse, portability, and debugging.
Audience: Solution architects and designers, or anyone with a background in real time data ingestion, or messaging systems like JMS.
|Fixed Width Data in PDI
For versions 6.x, 7.x, and 8.0 / published December 2017
PDI offers the Fixed File Input step for reading fixed-width text files. On the output side, there is no step dedicated to this specific purpose, but fixed-width text can still be written using the existing Text file output step. This document walks you through the changes you will need to make to the default column metadata to successfully accomplish this task.
Audience: Data analysts and ETL developers who need to write fixed-width data.
|Guidelines for Metadata Injection in PDI
For versions 7.x, 8.x, 9.0 / published March 2020
Here, you will find best practices for using template-driven designs and for navigating and operating the different levels of metadata injection. We have provided an example of how to build the data-driven rule extract/transform/load (ETL) transformation and make it flexible so that it can be added to, changed, or removed without adding development cycles.
Audience: Customers in need of tips for using template-driven designs, and for building the template-driven ETL transformation.
|Restartability in PDI
For versions 7.x, 8.x, 9.0 / published February 2020
Restartability architecture in PDI jobs and transformations plays a key role in restarting a failed ETL process where it left off. This is true whether you need to avoid duplicate entries in the target database, or you are simply seeking overall ETL efficiency and do not want to rerun processes that completed successfully in the previous run.
Audience: Pentaho ETL developers and architects, or anyone who is interested in learning PDI development patterns.
Working with the PDI Repository
|Dividing Large Repositories in PDI
For versions 6.x, 7.x, 8.x, 9.0 / published March 2020
Performance and management problems are addressed here by detailing an automated method to improve performance with segmenting a single repository into several smaller repositories using PDI.
Audience: Customers looking to create a single PDI repository for maintaining many environments.
|Logging and Monitoring in PDI
For versions 6.x, 7.x, 8.x / published February 2018
The main objective of this document is to provide information about the different options for best practices for PDI logging. Some of the things discussed here include reasons for using PDI logging, levels of logging, transformation and job logging, and debugging transformations and jobs.
Audience: Customers or developers who may be interested in using PDI logging.
|Performance Tuning in PDI
For versions 5.4, 6.x, 7.x / published August 2017
This guide provides an overview of factors that can affect the performance of PDI jobs and transformations, and it provides a methodical approach to identifying and addressing bottlenecks.
Audience: Customers who seek assistance with finding and fixing bottlenecks in their PDI performance.
|R on PDI
For version 6.x, 7.x, 8.0 / published December 2017
This document covers some best practices on integrating R with PDI, including how to install and use R with PDI and why you would want to use this setup.
Audience: Data analysts, data scientists, and PDI users who need to use the variety of statistical and machine learning tools available in the R environment.
|Pentaho Analyzer with Impala as a Data Source
For version 6.x, 7.x, 8.0 / published February 2018
This is a collection of best practices on using Pentaho Analyzer with Impala data sources, including how to prepare and partition data and set configurations. You will also learn about schema recommendations and settings for Analyzer.
Audience: Pentaho developers, system administrators, and architects.
|Pentaho and QuickStart VM
For version 7.x, 8.x / published May 2018
This document covers some best practices on integrating Pentaho software with Cloudera QuickStart VM, including how to configure the QuickStart VM so that Pentaho can connect to it.
Audience: Pentaho developers and system architects looking to experiment with PDI and Hadoop.
|Using PDI with HCP - new!
For version 8.3 / published November 2019
PDI can be used to move objects to and from Hitachi Content Platform (HCP). Starting in PDI 8.3, we have introduced native steps that give you the ability to query HCP objects as well as read and write object metadata, allowing you to create ETL solutions for batch file processing directly in the HCP environment.
Audience: ETL developers, BI developers, storage administrators, or anyone with a background in database development, BI, or IT storage who is interested in working with files stored in HCP using PDI.