Your feedback is important to us! Email us how we can improve these documents.
These recommendations are applicable to the following versions of products:
This document presents best practices around the installation of Pentaho's server and client products on Microsoft Azure, and gives an overview of the server, network, and storage architecture recommended to run Pentaho.
Pentaho products are accompanied by detailed Installation Instructions that explain how to configure servers, storage, and databases to host the Pentaho products. However, these installation instructions do not capture the data center environments in which the hardware and storage might be hosted. These best practices complement the product installation instructions by showing how the server and database configurations can be applied specifically in an Azure environment.
The primary focus of this document is the infrastructure architecture to support deployment of Pentaho's server and client products on Azure. The document does not focus on integrating Pentaho with Azure services such as Azure Active Directory, HDInsight, Azure Machine Learning, or other similar services.
Pentaho Components Overview
The Pentaho Components Reference contains details about software versions and configurations required for the Pentaho product suite, which consists of three categories of product:
|Pentaho Server||Installed on server equipment and runs unattended, typically 24x7. It executes Pentaho jobs and stores shared definitions of data stores, reports, and so on. Users typically do not log into the server directly.|
|Pentaho Server Repository||Database used by Pentaho Server to store definitions and contexts/preserve states necessary to run the server.|
|Pentaho Client Tools||Installed on client equipment like desktops or laptops, where users can interact directly with the tools, allowing users to create reports, define scheduled jobs, and perform other actions. Client tools can also interact with the server to do tasks like store report definitions or schedule background jobs.|
Best Practices and Recommendations for Azure
The Pentaho Server, Server Repository, and Client tools can be deployed to Microsoft Azure. To accomplish this in the ideal way, we recommend these best practices on the following topics:
Running in Azure Best Practices
Although there are a few different methods to set up Pentaho products with Azure, these recommendations will ensure maximum usability and efficiency.
Recommendation 1: Run Pentaho Server and Server Repository in Azure
If you are already using Azure, and all the data sources used by Pentaho are in Azure, then:
- The Pentaho Server should be deployed to Azure in a virtual machine (VM), and
- The associated Pentaho Server Repository database must also be deployed to an Azure virtual machine.
Recommendation 2: Do Not Run Pentaho Clients Tools in Azure
Users interact directly with the Pentaho Client tools, usually with a graphical user interface. Without local installation of the Pentaho client, the user may experience poor performance caused by network latency.
When the user's workstation is not in Azure (which is typically the case), then the Client tools should be installed either on the user's workstation itself, or on a server that has minimal network latency to the user's workstation.
Virtual Network Best Practices
We recommend you have a network architect design the logical network within which Pentaho will be hosting, spanning Azure and your own locations, accounting for scale and security to comply with your unique technology standards and business requirements.
Recommendation 1: Run Pentaho Products in Separate VNet
A separate virtual network (VNet) should be used to host the Pentaho products, sorted by development life-cycle stage. An example sequence would be:
- VNet1: Pentaho Production
- VNet2: Pentaho Development
- VNet3: Customer Applications
This delivers and supports:
- Separation of concerns
- Separation of duties
- Simplified firewall design
NOTE: If VNets are in the same region, we recommend you peer the Pentaho VNet with each VNet that contains a data source or application with which Pentaho needs to communicate. VNet to VNet connections should be used instead of peering if the VNets are in different regions.
Recommendation 2: Extend Corporate WAN to Include Azure
We recommend creating a virtual network between the corporate wide area network (WAN) and Azure if Pentaho needs access to any resources on the WAN. Although Pentaho traffic can be plainly routed over the public internet, this has the following disadvantages:
|Feature||Disadvantage(s) to Using Public Internet|
|Service Level Agreements (SLAs)||There is no guarantee of bandwidth availability, low latency, or connectivity.|
|Security||Not all protocols used during data transfer are encrypted.|
|Complexity||All customer sites and all applicable endpoint devices must be configured to talk over the internet to Pentaho in Azure. This includes routing and firewall configuration and maintenance.|
|Throughput||Bandwidth and latency are limited by the internet.|
There are two methods to connect the corporate WAN with Azure:
- Site to site virtual private network (VPN) connection: This addresses the security concern by ensuring that all traffic to and from Pentaho and the corporate WAN is IPSec-encrypted.
- Express route: This addresses the SLA, security, and throughput concerns. If it is designed correctly, it also addresses complexity concerns.
Recommendation 3: Network Security Groups and Subnets
For management simplicity and security transparency, subnet Network Security Groups (NSGs) should be used instead of virtual machine access control lists (ACL) to control traffic to and from the Pentaho servers. This is also general Azure best practice.
Place the Pentaho server (or servers if deployed active/active) in a separate NSG and subnet from the Pentaho repository. Open only the minimum number of ports, depending on the protocols being used to communicate with customer applications and data sources, and from the Pentaho server to the repository database. Which specific port numbers to open depends on your database product.
The following diagram outlines the recommended high-level network architecture including the recommendations from this section:
Virtual Machine and Storage Best Practices
We recommend that you commission an infrastructure architect to design the server and storage environments within which Pentaho will be hosted. It is critical the infrastructure be designed for scale and security to comply with your unique company standards, capacity, and performance requirements.
This section applies to designing the Azure resources that will host Pentaho. It does not dictate the infrastructure design, but instead highlights specific server and storage configurations and design fragments that should be incorporated into your overall infrastructure design to allow for best hosting of Pentaho.
NOTE: For detailed instructions on how to install Pentaho products on these servers and the specific OS configurations required, please see the Pentaho Installation Instructions and Components Reference.
Recommendation 1: Operating Systems
Pentaho Server and Repository are compatible with a range of operating systems, listed in the Components Reference. The validity and applicability of this list of operating systems is not impacted by Pentaho being deployed in Azure.
Select a supported operating system that fits best with your overall infrastructure architecture.
Recommendation 2: Pentaho Servers Specification
This section is applicable to the Azure virtual machines that host the Pentaho Server and the Pentaho Repository database.
An infrastructure architect should work with Pentaho personnel to appropriately size the Pentaho-hosting virtual machines to ensure adequate performance, capacity, and availability. Azure-specific concepts and restrictions to be incorporated during this exercise are outlined below.
Azure virtual machines must not only be sized for central processing unit (CPU) capacity, but also for network and disk IO. Each Azure VM type has a cap on the total disk and network IO it can process. Therefore, the virtual machine size selection must take this into account:
- The Pentaho server must have an appropriate network total IO cap to allow communication among the many data sources, applications, and clients.
- The Pentaho repository server must have an appropriate disk total IO cap to allow the repository database to be read/written to.
Pentaho products can be deployed to provide High Availability (HA) and Disaster Recovery (DR). As the designs and mechanisms to achieve this are not specific to Azure, you can apply your corporate virtual machine HA and DR standards to Pentaho servers.
For example, Pentaho Server can be deployed in an active/active HA configuration. Further details on this are available in Best Practices - Pentaho Servers with High Availability (download the file corresponding to the version of Pentaho you are running).
Recommendation 3: Pentaho Database Product
Supported database products and versions for the Pentaho Repository database are listed in the Components Reference. A supported database should be installed on an Azure VM.
Database as a service, including Azure structured query language (SQL), should not be used. See the known issue below regarding database as a service and Pentaho.
These maintenance recommendations will help you keep your Pentaho installation running at peak performance.
Recommendation 1: Backup
Since the Pentaho products run on standard Azure virtual machines, generally those VMs should be backed up using the same mechanism as your other Azure VMs. Similarly, the Pentaho database should be backed up as if it were just another database running on a VM. There are no backup requirements that are specific to Pentaho on Azure.
More information on backing up Pentaho can be found in Best Practices - Backup and Recovery.
Recommendation 2: Monitoring
Similar to backups, there are no specific additional monitoring requirements for Pentaho on Azure. The Pentaho VMs and repository database should be monitored just the same way that other VMs and databases are monitored.
Learn more about monitoring Pentaho's processes in any environment, including Azure, in Best Practices - Logging and Monitoring for Pentaho Servers.
The following issues apply to Pentaho products running in Azure. Make sure to account for them in your infrastructure and network architecture.
Repository Database-Managed Service
The Pentaho Server's repository database is not currently supported on Azure's managed database solution.
Workaround: The repository database should be installed on an Azure virtual machine conforming to the Pentaho-supported configurations listed in the Components Reference.
Azure Load Balancer and Scale Sets
Pentaho Servers can run in active/active load-balanced mode, for example, multiple Tomcat servers running copies of Pentaho Server. However, Azure Load Balancer and Scale Sets have not been tested as balancers for this design pattern.
Workaround: If you wish to run the Pentaho Server in active/active mode, use Apache Web Server to load-balance incoming traffic as in Best Practices - Pentaho Servers with High Availability (download the file corresponding to the version of Pentaho you are running).
These additional documents contain information related to this subject:
- Best Practices - Backup and Recovery
- Best Practices - Pentaho and Amazon Web Services
- Best Practices - Pentaho Servers with High Availability
- Pentaho Best Practices Library
- Pentaho Components Reference
- Supported configurations
- Software and configurations not mentioned in this document