Your feedback is important to us! Email us how we can improve these documents.
This document covers some best practices on architectural and installation considerations when running Pentaho applications on Google Cloud Platform (GCP). We advise customers running on GCP to take other factors into consideration when using this guide and to depend on an internal enterprise architect to design solutions using Pentaho software.
This document is not intended to dictate what the best options are, but rather to present some best practices for customers who are seeking to optimize and set up Pentaho in the cloud while taking advantage of the features of GCP. Some of the topics covered here include integration options, setting up Google projects, configuring networks and security, and logging and monitoring.
The intention of this document is to speak about topics generally; however, these are the specific versions covered here:
This document assumes that the reader is familiar with the Pentaho suite or other business intelligence and data integration software packages. Details on the Pentaho suite are available at Pentaho Products.
This document provides guidance for new customers considering using Pentaho on Google Cloud Platform. The focus of this document is to discuss overall architectural considerations in the discovery and design phases of analytics projects where the Pentaho suite will be used. Specifics such as data ingestion, managing data pipelines, data processing patterns, and so on will be discussed in another document.
Naming Standards for Cloud Resources
While naming standards are not the primary focus of this document, having a naming standard for cloud resources is a good practice that will minimize rework as more work is shifted to the cloud.
Naming standards are even more important in the cloud, because:
- Anyone with the right privileges can create resources, and
- Most resources (projects, networks, compute engines, databases, IP addresses, and so on) in GCP or any other cloud provider can be freely named so long as they are globally unique.
Create a naming standard for your cloud resources using one or more of these suggestions. Names should include:
- Some global identifier that is unique across the enterprise (or division within the enterprise).
- An abbreviation of the point in the lifecycle the resource is in (dev, test/QA, prod…).
- An abbreviation or acronym of the service that runs on the resource.
- The name of the type of resource, where appropriate. For example, identify resources such as external IP addresses, disks, networks, snapshots, images, or other types. You can add indicators (
-ip, -disk, -snapshot, -img) to resource names that are created manually or programmatically. Some exceptions to this include service databases and BigQuery datasets, because GCP further qualifies those names prior to creating them.
Integration and Setup
There are different paths you can take to integrate your organization and Pentaho products. To choose the best options, you will need to know your organization’s needs and current setup, and then select the right solutions for your situation. Each of these sections has considerations to make and questions to answer:
Pentaho’s software consists of many applications and tools, and several options exist for installing and using the software, depending on your organization’s goals. As it is typically best to process and analyze data closest to where it is stored, the location of the data will be a key factor to consider when selecting where and how to install Pentaho services and tools. Your three integration options are installing Pentaho local to your organization, installing Pentaho in the cloud, or installing Pentaho in a hybrid fashion:
Local Pentaho Installation
You can install Pentaho locally If you have typical data stores on premises, such as traditional databases, FTP hooks, and third-party data integration gateways. Pentaho can then be configured to read from or write to GCP (cloud storage, BigQuery, DataProc, Pub/Sub, and so on).
In this case, the bulk of the data being processed would be local to your organization, and new and emerging data storage and processing patterns would be developed on GCP. When Pentaho services are installed on premises, you can use Pentaho to analyze vast amounts of data in Google BigQuery, due to BigQuery’s multi-parallel-processing architecture as well as the BigQuery Simba JDBC driver.
If you are investigating the option of storing, processing, and analyzing data in new data lakes or data warehouses in the cloud, we recommend you install Pentaho services in the cloud to speed up your application response time. Co-locating Pentaho services with the data that needs to be processed and presented in reports, dashboards, and analyses will speed up your application response time. Choosing this approach would locate Pentaho Server and Pentaho Repository in GCP.
If you have mixed workloads distributed between local data storage and GCP, you can install Pentaho services in a hybrid fashion. In this method, you choose GCP or local installation for Pentaho and additional services for job execution, depending on your needs.
Setting Up Google Projects
Google Cloud projects are the core organizational component within GCP. All cloud platform resources such as compute engines, databases, and application engines, belong to a project and can be easily managed using the GCP Console or the Resource Manager application program interface (API).
Depending on the scale of the solutions being deployed in GCP, you may want to run all resources within one project, or spread them across multiple projects. Further information on setting up Google Cloud projects is available at the Cloud Platform Resource Hierarchy. Since GCP is a shared resource within organizations, your planning stage must include the consideration of how costs will be allocated to your various business units that will be interacting with GCP. For smaller deployments or small-to-mid-size enterprises with smaller GCP footprints, it is possible to run all services from one GCP project. However, as more and more business solutions are deployed and the environment becomes more and more complex, we advise you to split resources and services into:
- Tools projects, which would host shared applications and utilities (in this case, the Pentaho stack) that ingest, process, and analyze data for one or more solution projects.
- Solution projects, which would host solutions for business functions or teams.
For example, a solution project could be one that ingests data from third parties and loads and analyzes the data for a business unit using BigQuery or DataProc (GCP’s Hadoop service).
Setting up a Tools Shared Services project with all Pentaho services installed would allow your enterprise to easily bill the various departments and business units that consume Pentaho services. Such a project would contain the following Pentaho components:
- Pentaho Server (application running in a Tomcat container)
- Pentaho Repository (see Components Reference for a list of supported databases)
- Pentaho execution engines (Carte Clusters/Servers)
- Git, for revision control
- Nexus Jenkins, for continuous integration
Figure 1: Pentaho Services in GCP Products
Choosing an Engine to Run Pentaho Applications
GCP offers a range of options for deploying applications, including:
- Compute engines (virtual machines)
- Container engines
- Application engines
The officially supported application servers for Pentaho Server are JBoss and Tomcat. Typically, Pentaho would be installed on GCP compute engines.
|Although Pentaho Server can be manually installed as a J2EE application on GCP App Engine, it is NOT a supported application server.|
Carte is a simple web server that remotely executes transformations and jobs by accepting XML using a small servlet. This servlet contains the transformation to execute, along with the execution configuration. Carte Server also allows remote monitoring, starting, and stopping of transformations and jobs that it runs.
Carte Servers can be deployed in GCP containers. Because Pentaho Data Integration architecture supports dynamic Carte clustering, you can spin up multiple Carte Servers running in containers to process dynamic workloads. For more information, refer to the Dynamic Clustering features of Carte Cluster.
Configuring your installation is a process that requires optimization of many different areas to accommodate your organization’s needs. Defaults may suit you for some configuration purposes, but in other cases, you may want to change settings and options. Make sure that you address the following:
- Configuring Networks
- Secured Connections
- Cloud DNS
- Firewall Rules
- Cloud VPN
- External IP Addresses
- Load Balancing
- Integrated Security
- Running Pentaho Client Tools in GCP
- Logging and Monitoring
Every project in GCP is assigned a default network. By default, Google randomly assigns IP addresses in designated IP address blocks based on what region the project was created in.
|To minimize the network traffic necessary to access and process data for analysis within the Pentaho stack, we recommend you create primary resources like databases and compute engines all within the same region (US East 1, US Central, etc.).|
Figure 2: Default Networks in GCP
Especially where Pentaho services are installed in GCP, communication between servers and clients within GCP and into and out of GCP should be secured using SSL certificates signed by a certificate authority (CA), whenever possible. An added advantage of SSL, aside from the security, is that SSL compression has been noted to increase response efficiency.
Most organizations have established local and wide area networks that connect all their local resources. To integrate existing networks with GCP and with Pentaho services, at a high level, you must extend your corporate network to GCP by configuring a subdomain pointing to a GCP cloud domain name server (DNS). To minimize the work involved in securing all resources in GCP, after implementation, we recommend you configure GCP projects to use IP addresses from within preassigned corporate ranges, rather than using GCP default IP address ranges. Your DNS will serve as the gateway for network traffic to and from GCP for your organization’s needs.
Firewalls default to allow traffic from inside and outside of GCP to all targets inside GCP. Use the following example to create firewall rules that allow traffic to and from all compute engines and databases that support Pentaho all within the same network:
Figure 3: Firewall Rules
Cloud Virtual Private Network (VPN), the conduit for interconnecting GCP projects, is configured to allow traffic from Pentaho Server running in one tools project to a solution project containing data that needs processing or analyzing. Creating VPN simplifies the network topology in GCP.
External IP Addresses
Compute engines are assigned default IP addresses that are accessible only from within GCP networks (that is, the project’s network).
|To directly access a compute engine, such as one running Tomcat web server, the compute engine would have to be assigned an external IP address.|
In contrast to the default internal IP addresses, external IP addresses incur an additional cost. Therefore, assign external IP addresses only to those computers which need to be accessed directly, not through a load balancer, or which need to be configured to access databases. For example, MySQL access within GCP requires that the compute engines have external IP addresses.
Installations of Pentaho on premises typically follow load balancing models where a web server such as Apache (HTTPD) forwards traffic in a round-robin fashion to one or more web application servers (Tomcat) running Pentaho applications.
Figure 4: Load Balancing Using Apache and Tomcat
Although similar models can be implemented on GCP, they would have a single point of failure in the Apache HTTPD, and would not make use of the native load balancing options offered by GCP:
Figure 5: Load Balancing Using GCP
The recommended load balancing option for Pentaho Services is HTTP(S) load balancing, where compute engines are assigned to compute engine instance groups:
Figure 6: HTTP(S) Load Balancing
Load balancers include health checks of the compute engines mapped to instance groups. Health checks are tied to monitoring and alerts through Stackdriver Logging, allowing enterprises to proactively monitor the health of the entire Pentaho stack.
|Apache HTTPD is unnecessary and therefore optional because GCP load balancers can direct traffic to Tomcat application services and proactively monitor Tomcat.|
As shown in Figure 7, all traffic to Pentaho compute engines uses the default secure sockets layer (SSL) port 443. The traffic is routed to the load balancer from CloudDNS. If Apache web server is used, it can route traffic to the application server (Tomcat) hosting Pentaho services, running on a designated port (8443 in Figure 7):
Figure 7: Load Balancing Using Port 443
Pentaho can be configured to authenticate users with Microsoft Active Directory and/or LDAP entitlement databases, depending on where Pentaho Services are installed: on premises, in the cloud, or hybrid:
- Installations on premises can directly connect to and use Active Directory and LDAP.
- Cloud-only installations of Pentaho Server would need to use web-based SAML authentication.
GCP provides a framework and tools that support integration with Active Directory and LDAP entitlement databases, using Google Cloud Directory Sync (GCDS). GCP also supports SAML 2.0-based single sign-on (SSO), which provides seamless SSO against all GCP interfaces.
Details on configuring Pentaho Server to support SAML authentication for cloud-based deployments can be found at Set Up SAML for the Pentaho Server.
Running Pentaho Client Tools in GCP
While we recommend you install and use Pentaho client tools on developer workstations on premises, working locally may impose slight latency issues due to network traffic to and from the Pentaho Repository, if it is not local. In this case, commission an enterprise architect to review the network topology between GCP and the corporate network.
Logging and Monitoring
Pentaho applications use the
log4j logging framework, and all Pentaho applications can be configured to write logs to appenders using all the standard
log4j options. In addition to detailed logs, Pentaho provides the Operations Mart, a solution that populates a data mart used to audit and monitor user activity through a set of reports and analytics.
Within GCP, you can use Pentaho logs for operational support of Pentaho analytic solutions by using Google Stackdriver to ingest, process, and interpret Pentaho log data in near-real-time. Stackdriver logging is already configured by default to monitor dozens of metrics within GCP, and Pentaho logs can be ingested into Stackdriver in different ways, including:
- Install a Stackdriver agent, configured to read log directories, ingest logs, and parse content, on all compute engines running Pentaho Servers.
- Use GCP Pub/Sub to route logging data to Stackdriver.
Pentaho Installation Guidance
Pentaho provides detailed product installation guides online. In addition to these, you may find the following recommendations about cloud deployments on GCP helpful:
Choosing a Database for the Pentaho Repository
Pentaho supports a number of databases for use in hosting the Pentaho repository. Please check the Components Reference for a supported database for use in your installation.
When installing Pentaho on GCP, you have choices about what database engine and installation type to use:
- Installing a database on a compute engine and managing it, or
- Using a GCP fully-managed database as a service, which includes features such as automatic backups and failover.
|We recommend you use a database as a service, rather than an installed instance of a database on a compute engine.|
GCP currently supports MySQL and PostgreSQL. If you are using MySQL (CloudSQL), choose a MySQL Second Generation instance type.
Figure 9: Choosing a MySQL Instance Type
By default, GCP disables certain database roles for security reasons. For example, the MySQL SuperUser role is revoked by default, so that no user can create functions. To reverse this restriction, use gcloud commands:
gcloud auth login
gcloud config set project tccc-test-shared-services
gcloud sql instances patch pentaho-repo --database-flags
Installing Across Lifecycles (Dev, Test/QA, Production)
GCP provides utilities to easily create snapshots of compute engines. This feature simplifies Pentaho installation across lifecycles by allowing a system administrator to create a pristine copy of a compute engine, install the necessary software, and clone it for use in other GCP projects.
You can create snapshots of compute engines, and create disks from snapshots in the same GCP project (in the case of creating additional Pentaho nodes for a High Availability (HA) cluster). If you need to create disks in a higher lifecycle (QA or production), you can create images from non-attached disks and move them to the appropriate GCP projects using the
gcloud command line tool.
Figure 10: Creating Images for GCP Projects
- Log in and set default project.
gcloud auth login
gcloud config set project
- Create snapshot.
gcloud compute --project "" disks snapshot
"example-dev-pentaho-01-disk-01" --zone "us-east1-b" --snapshot-names
- Create disk from snapshot.
gcloud compute --project "" disks create
"example-pentaho-02-disk-01" --size "110" --zone "us-east1-b" --source-snapshot
"example-pentaho-01-snapshot-01" --type "pd-standard"
- Create image from disk.
gcloud compute --project "" images create
"example-pentaho-01-img" --source-disk-zone "us-east1-b" --source-disk
Making Pentaho Services Resumable
We recommend configuring services to resume automatically on compute engine restart/boot, particularly for cloud installers. Pentaho provides scripts and a template to create start-up scripts that should be placed in
/etc/init.d on Linux servers. More information on resuming on boot is available at Start and Stop the Pentaho Server for Configuration.
Here are some links to information that you may find helpful while using this best practices document: