ELT & ETL Orchestration
Last Updated: 2022-02-02It's easy to integrate Grouparoo into your existing ELT/ELT process! Grouparoo is a core part of the modern data stack and provides a number of ways to integrate with the rest of your data infrastructure. It is compatible with a number of different orchestrators - tools that schedule and coordinate different parts of your data pipeline.
There are 4 main ways to tell Grouparoo when to check for new or updated data:
- Triggering Schedules via Time
- Triggering Schedules via the API
- Triggering Schedules via a Query
- Running Grouparoo directly via the CLI
Triggering Schedules via Time
The simplest way to check for new data is to create a recurring Schedule within Grouparoo. A recurring Schedule will check your Source at a set time interval for new data. It then imports any new data it finds.
For example, this configuration below would check the users
table in the data warehouse for new data every 30 minutes. We know data is new by referring to the updated_at
column and storing the largest value seen as a High-Water Mark
// from config/sources/users.json
[
{
class: "Source",
id: "users",
modelId: "users",
name: "Data Warehouse Users",
type: "redshift-import-table",
appId: "data_warehouse",
options: {
table: "users",
},
mapping: {
id: "user_id",
},
},
{
class: "Schedule",
id: "users_schedule",
name: "Users Schedule",
sourceId: "users",
recurring: true,
recurringFrequency: 1800000, // 30 minutes in ms
confirmRecords: false,
options: {
column: "updated_at",
},
filters: [],
},
];
This method is the simplest of the four options and is built directly into Grouparoo.
Triggering Schedules via the API
Using the Grouparoo API, you can choose to run a Schedule whenever it is right for you. A common use-case is to trigger all of your Grouparoo Schedules to look for new & updated Records at the conclusion of a dbt transform step via an Orchestrator like Airflow or Dagster.
The /api/v1/schedules/run
endpoint can be used to trigger a Run for your Schedules. You can run all of your Schedules or provide a list of specific scheduleIds
to run.
First, create an API Key with the write
permission on Sources
. Then you can use the API as documented below:
# Set Variables
GROUPAROO_HOST="http://localhost:3000"
GROUPAROO_API_KEY="ba05b18d9dae4acbb08799e7ef238f0d"
# Request to run all Schedules now
curl \
-X POST \
-d "apiKey=$GROUPAROO_API_KEY" \
"$GROUPAROO_HOST/api/v1/schedules/run"
# --- OR ---
# Request to view the available schedules
curl \
-X GET \
"$GROUPAROO_HOST/api/v1/schedules?apiKey=$GROUPAROO_API_KEY"
# Request to run a specific schedule now
# The names of your schedules will be of the format (source_name)_schedule
# Provide IDs as a JSON Array
curl \
-X POST \
-d "apiKey=$GROUPAROO_API_KEY" \
-d "scheduleIds=[\"users_table_schedule\"]" \
"$GROUPAROO_HOST/api/v1/schedules/run"
More information about using Grouparoo's REST API can be found here, including how to generate API Keys. You can learn more about the specifics of this (or any) API endpoint that your Grouparoo instance provides by visiting the /swagger
page on your Grouparoo instance.
If you want to wait until the Runs created by the call to /api/v1/schedules/run
are complete before moving on to the next step in your data pipeline, you can! You can also use the API to retrieve run status:
# Store the Run ID
RUN_ID="run_aad727ae-b8f8-4a6a-b205-015113bf0e7c"
# Request details of a Run
# Pipe through `jq` to get just the property we are looking for
curl \
-X GET \
"$GROUPAROO_HOST/api/v1/run/$RUN_ID?apiKey=$GROUPAROO_API_KEY" \
| jq '.run.state'
Note that in order to use this method, your API Key will need read access to the run
permission as well.
Triggering Schedules via a Query
Grouparoo can be run by querying a table within your App using an App Refresh Query when configuring your App. For more information on implementing an App Refresh Query, check out our App Configuration docs.
At a high level, using a query to orchestrate your runs allows you to do things like sync data from multiple sources when a particular source, or even a meta or logs table, has new data.
Running Grouparoo directly via the CLI
Grouparoo can always be run via the command line with grouparoo run
. It is important to note that this method runs once and then terminates when the job is completed. The grouparoo run
command is a manual option that allows you to check for updates only once, each time you explicitly run it.
Some customers may want to run Grouparoo via the CLI directly from their orchestrator as the final part of their ETL/ELT pipeline. Alternatively, it could simply run every night via cron
. For these use cases, we recommend configuring Grouparoo via the config UI and using the grouparoo run
command. This will run all Schedules immediately when grouparoo run
is started.
$ grouparoo run
š¦ Grouparoo: run
notice: Starting Grouparoo v0.7.4 on node.js v16.8.0
# wait for completion...
ā-- š¦ Run Completed @ 2021-11-12T18:55:38.806Z ---
|
| RECORDS
|
| * Records Updated: 105
| * Records Created: 105
| * All Records: 105
|
| SOURCES
|
| 1. Product Users
| * Records Created: 100
| * Records Imported: 100
| * Imports Created: 101
| * Error: none
|
| 2. Accounts
| * Records Created: 5
| * Records Imported: 5
| * Imports Created: 6
| * Error: none
|
| DESTINATIONS
|
| 1. Logger Destination
| * Exports Created: 100
| * Exports Failed: 0
| * Exports Complete: 100
ā--------------------------------------------------
notice: All Tasks Complete!
There are flags which can be used to customize the run command's behavior. This method of triggering Schedules is only applicable when self-hosting Grouparoo.
Having Problems?
If you are having trouble, visit the list of common issues or open a Github issue to get support.