Your feedback is important to us! Email us how we can improve these documents.
This page serves as a library for each of the Pentaho Data Integration (PDI) Best Practices, Guidelines, and Techniques documents. It contains helpful information for the many uses, components and standards that have been developed to help you accomplish optimal use and performance.
- Design Standards and Guidelines
- PDI Project Setup and Lifecycle Management - new!
- Design Guidelines for PDI
- Naming Standards for PDI
- Standards for Lookups, Joins, and Subroutines in PDI
- How-To Guides
- Realtime Data Processing with PDI
- Fixed Width Data in PDI
- Guidelines for Metadata Injection in PDI
- Restartability in PDI
- Working with the PDI Repository
- Dividing Large Repositories in PDI
- Logging and Monitoring in PDI
- Performance Tuning in PDI
- R on PDI
- Pentaho Analyzer with Impala as a Data Source
- Pentaho and QuickStart VM
The Components Reference in Pentaho Documentation has a complete list of supported software and hardware.
Design Standards and Guidelines
|Realtime Data Processing with PDI
For version 8.x / published May 2019
This is part of a webinar covering best practices on designing and building your PDI transformations and jobs for maximum speed, reuse, portability, and debugging.
Audience: Solution architects and designers, or anyone with a background in real time data ingestion, or messaging systems like Java Message Servers.
|Fixed Width Data in PDI
For versions 6.x, 7.x, and 8.0 / published December 2017
PDI offers the Fixed File Input step for reading fixed-width text files. On the output side, there is no step dedicated to this specific purpose, but fixed-width text can still be written using the existing Text file output step. This document walks you through the changes you will need to make to the default column metadata to successfully accomplish this task.
Audience: Data analysts and ETL developers who need to write fixed-width data.
|Guidelines for Metadata Injection in PDI
For versions 6.x, 7.x, 8.0 / published December 2017
Here, you will find best practices for using template-driven designs and for navigating and operating the different levels of metadata injection. We have provided an example of how to build the data-driven rule extract/transform/load (ETL) transformation and make it flexible so that it can be added to, changed, or removed without adding development cycles.
Audience: Customers in need of tips for using template-driven designs, and for building the template-driven ETL transformation.
|Restartability in PDI
For versions 6.x, 7.x, 8.x / published December 2017
Restartability architecture in PDI jobs and transformations plays a key role in restarting a failed ETL process where it left off. This is true whether you need to avoid duplicate entries in the target database, or you are simply seeking overall ETL efficiency and do not want to rerun processes that completed successfully in the previous run.
Audience: Pentaho ETL developers and architects, or anyone who is interested in learning PDI development patterns.
Working with the PDI Repository
|Dividing Large Repositories in PDI
For versions 6.x, 7.x, 8.0 / published August 2017
Performance and management problems are addressed here by detailing an automated method to improve performance with segmenting a single repository into several smaller repositories using PDI.
Audience: Customers looking to create a single PDI repository for maintaining many environments.
|Logging and Monitoring in PDI
For versions 6.x, 7.x, 8.x / published February 2018
The main objective of this document is to provide information about the different options for best practices for PDI logging. Some of the things discussed here include reasons for using PDI logging, levels of logging, transformation and job logging, and debugging transformations and jobs.
Audience: Customers or developers who may be interested in using PDI logging.
|Performance Tuning in PDI
For versions 5.4, 6.x, 7.x / published August 2017
This guide provides an overview of factors that can affect the performance of PDI jobs and transformations, and it provides a methodical approach to identifying and addressing bottlenecks.
Audience: Customers who seek assistance with finding and fixing bottlenecks in their PDI performance.
|R on PDI
For version 6.x, 7.x, 8.0 / published December 2017
This document covers some best practices on integrating R with PDI, including how to install and use R with PDI and why you would want to use this setup.
Audience: Data analysts, data scientists, and PDI users who need to use the variety of statistical and machine learning tools available in the R environment.
|Pentaho Analyzer with Impala as a Data Source
For version 6.x, 7.x, 8.0 / published February 2018
This is a collection of best practices on using Pentaho Analyzer with Impala data sources, including how to prepare and partition data and set configurations. You will also learn about schema recommendations and settings for Analyzer.
Audience: Pentaho developers, system administrators, and architects.
|Pentaho and QuickStart VM
For version 7.x, 8.x / published May 2018
This document covers some best practices on integrating Pentaho software with Cloudera QuickStart VM, including how to configure the QuickStart VM so that Pentaho can connect to it.
Audience: Pentaho developers and system architects looking to experiment with PDI and Hadoop.