The power of remote engine execution for ETL/ELT data pipelines

Business
leaders
risk
compromising
their
competitive
edge
if
they
do
not
proactively
implement generative
AI
 (gen
AI).
However,
businesses
scaling
AI
face
entry
barriers.
Organizations
require
reliable
data
for
robust
AI
models
and
accurate
insights,
yet
the
current
technology
landscape
presents
unparalleled
data
quality
challenges.

According
to
International
Data
Corporation
(IDC), stored
data
is
set
to
increase
by
250%
by
2025
,
with
data
rapidly
propagating
on-premises
and
across
clouds,
applications
and
locations
with
compromised
quality.
This
situation
will
exacerbate
data
silos,
increase
costs
and
complicate
the
governance
of
AI
and
data
workloads. 

The
explosion
of
data
volume
in
different
formats
and
locations
and
the
pressure
to
scale
AI
looms
as
a
daunting
task
for
those
responsible
for
deploying
AI.
Data
must
be
combined
and
harmonized
from
multiple
sources
into
a
unified,
coherent
format
before
being
used
with
AI
models.
Unified,
governed
data
can
also
be
put
to
use
for
various
analytical,
operational
and
decision-making
purposes.
This
process is known
as data
integration,
one
of
the
key
components
to
a
strong
data
fabric.
End
users
cannot
trust
their
AI
output
without
a
proficient
data
integration
strategy
to
integrate
and
govern
the
organization’s
data. 

The
next
level
of
data
integration

Data
integration
is
vital
to
modern
data
fabric
architectures,
especially
since
an
organization’s
data
is
in
a
hybrid,
multi-cloud environment
and
multiple
formats.
With
data
residing
in
various
disparate
locations,
data
integration
tools
have
evolved
to
support
multiple
deployment
models.
With
the
increasing
adoption
of
cloud
and
AI,
fully
managed
deployments
for
integrating
data
from
diverse,
disparate
sources
have
become
popular. For
example,
fully
managed
deployments
on
IBM
Cloud
enable
users
to
take
a
hands-off
approach
with
a
serverless
service
and
benefit
from application
efficiencies
like
automatic
maintenance,
updates
and
installation.

Another
deployment
option
is
the self-managed
approach,
such
as
a
software
application
deployed
on-premises,
which
offers users
full
control
over
their
business-critical
data,
thus
lowering
data
privacy,
security
and
sovereignty
risks.

The

remote
execution
engine

is
a
fantastic
technical
development
which
takes
data
integration
to
the
next
level.
It
combines
the
strengths
of
fully
managed
and self-managed
deployment
models
to
provide
end
users
the
utmost
flexibility.

There
are
several
styles
of
data
integration.
Two
of
the
more
popular
methods, extract,
transform,
load
(ETL
)
and extract,
load,
transform
(ELT)
,
are both highly
performant
and
scalable.
Data
engineers
build
data
pipelines,
which
are
called
data
integration
tasks
or
jobs,
as
incremental
steps
to
perform
data
operations
and
orchestrate
these
data
pipelines
in
an
overall
workflow.
ETL/ELT
tools
typically
have
two
components:
design
time
 (to
design
data
integration
jobs)
and
runtime (to
execute
data
integration
jobs).

From
a
deployment
perspective,
they
have
been
packaged
together,
until
now.
The
remote
engine
execution
is
revolutionary
in
the
sense
that
it decouples design
time
and
runtime,
creating
a
separation
between
the
control
plane
and
data
plane
where
data
integration
jobs
are
run.
The
remote
engine
manifests
as
a
container
that
can
be
run
on
any
container
management
platform
or
natively
on
any
cloud
container
services.
The
remote
execution
engine
can
run
data
integration
jobs
for
cloud
to
cloud,
cloud
to
on-premises,
and
on-premises
to
cloud
workloads.
This
enables
you
to
keep
the
design
timefully
managed,
as
you
deploy
the
engine
(runtime)
in
a customer-managed environment,
on
any
cloud
such
as
in
your
VPC,
any
data
center
and
any
geography.

This
innovative
flexibility
keeps
data
integration
jobs
closest
to
the
business
data
with
the customer-managed runtime.
It
prevents
the fully
managed design
time
from
touching
that
data, improving security
and
performance
while
retaining
the application
efficiency benefits
of
a
fully
managed
model.

The
remote
engine
allows
ETL/ELT
jobs
to
be
designed
once
and
run
anywhere.
To
reiterate,
the
remote
engines’
ability
to
provide
ultimate
deployment
flexibility
has
compounding
benefits:

  • Users
    reduce
    data
    movement
    by
    executing
    pipelines
    where
    data
    lives.
  • Users
    lower
    egress
    costs.
  • Users
    minimize
    network
    latency.
  • As
    a
    result,
    users
    boost
    pipeline
    performance
    while
    ensuring
    data
    security
    and
    controls.

While
there
are
several
business
use
cases
where
this
technology
is
advantageous,
let’s
examine
these
three: 

1. Hybrid
cloud
data
integration

Traditional
data
integration
solutions
often
face
latency
and
scalability
challenges
when
integrating
data
across
hybrid
cloud
environments.
With
a
remote
engine,
users
can
run data
pipelines anywhere,
pulling
from
on-premises
and
cloud-based
data
sources, while
still
maintaining
high
performance.
This
enables
organizations
to
use
the
scalability
and
cost-effectiveness
of
cloud
resources
while
keeping
sensitive
data
on-premises
for
compliance
or
security
reasons.



Use



c
ase


s
cenario: Consider
a
financial
institution
that
needs
to
aggregate
customer
transaction
data
from
both
on-premises
databases
and
cloud-based
SaaS
applications.
With
a
remote
runtime,
they
can
deploy
ETL/ELT
pipelines
within
their

virtual
private
cloud
(VPC)
 to
process
sensitive
data
from
on-premises
sources
while
still
accessing
and
integrating
data
from
cloud-based
sources.
This
hybrid
approach
helps
to
ensure
compliance
with
regulatory
requirements
while
taking
advantage
of
the
scalability
and
agility
of
cloud
resources.

2. Multicloud
data
orchestration and
cost
savings

Organizations
are
increasingly
adopting
multicloud
strategies
to
avoid
vendor
lock-in
and
to
use
best-in-class
services
from
different
cloud
providers.
However,
orchestrating
data
pipelines
across
multiple
clouds
can
be
complex
and
expensive
due
to
ingress
and
egress
operating
expenses
(OpEx).
Because
the
remote
runtime
engine
supports
any
flavor
of
containers
or
Kubernetes,
it
simplifies
multicloud
data
orchestration
by
allowing
users
to
deploy
on
any
cloud
platform
and
with
ideal
cost
flexibility.

Transformation
styles
like
TETL
(transform,
extract,
transform,
load)
and
SQL
Pushdown
also
synergies
well
with
a
remote
engine
runtime
to
capitalize
on
source/target
resources
and
limit
data
movement,
thus
further
reducing
costs.
With
a
multicloud
data
strategy,
organizations
need
to
optimize
for
data
gravity
and
data
locality.
In
TETL,
transformations
are
initially
executed
within
the
source
database
to
process
as
much
data
locally
before
following
the
traditional
ETL
process.
Similarly,
SQL
Pushdown
for
ELT
pushes
transformations
to
the
target
database,
allowing
data
to
be
extracted,
loaded,
and
then
transformed
within
or
near
the
target
database.
These
approaches
minimize
data
movement,
latencies,
and
egress
fees
by
leveraging
integration
patterns
alongside
a
remote
runtime
engine,
enhancing
pipeline
performance
and
optimization,
while
simultaneously
offering
users
flexibility
in
designing
their
pipelines
for
their
use
case.



Use



c
ase


s
cenarioSuppose
that
a
retail
company
uses
a
combination
of
Amazon
Web
Services
(AWS)
for
hosting
their
e-commerce
platform
and
Google
Cloud
Platform
(GCP)
for
running
AI/ML
workloads.
With
a
remote
runtime,
they
can
deploy
ETL/ELT
pipelines
on
both
AWS
and
GCP,
enabling
seamless
data
integration
and
orchestration
across
multiple
clouds.
This
ensures
flexibility
and
interoperability
while
using
the
unique
capabilities
of
each
cloud
provider.

3. Edge
computing
data
processing

Edge
computing
is
becoming
increasingly
prevalent,
especially
in
industries
such
as
manufacturing,
healthcare
and
IoT.
However,
traditional
ETL
deployments
are
often
centralized,
making
it
challenging
to
process
data
at
the
edge
where
it
is
generated.
The
remote
execution
concept
unlocks
the
potential
for
edge
data
processing
by
allowing
users
to
deploy
lightweight,
containerized
ETL/ELT
engines
directly
on
edge
devices
or
within
edge
computing
environments.



Use



c
ase


s
cenarioA
manufacturing
company
needs
to
perform
near
real-time
analysis
of
sensor
data
collected
from
machines
on
the
factory
floor.
With
a
remote
engine,
they
can
deploy
runtimes
on
edge
computing
devices
within
the
factory
premises.
This
enables
them
to
preprocess
and
analyze
data
locally,
reducing
latency
and
bandwidth
requirements,
while
still
maintaining
centralized
control
and
management
of
data
pipelines
from
the
cloud.

Unlock
the
power
of
the
remote
engine
with
DataStage-aaS
Anywhere

The
remote
engine
helps
take
an
enterprise’s
data
integration
strategy
to
the
next
level
by
providing
ultimate
deployment
flexibility,
enabling
users
to
run
data
pipelines
wherever
their
data
resides.
Organizations
can
harness
the
full
potential
of
their
data
while
reducing
risk
and
lowering
costs.
Embracing
this
deployment
model
empowers
developers
to
design
data
pipelines
once
and
run them anywhere,
building resilient
and
agile
data
architectures
that
drive
business
growth. 
Users
can
benefit
from
a
single
design
canvas,
but
then
toggle
between
different
integration
patterns
(ETL,
ELT
with
SQL
Pushdown,
or
TETL),
without
any
manual
pipeline
reconfiguration,
to
best
suit
their
use
case.


IBM
®

DataStage
®-aaS
Anywhere
 benefits
customers
by
using
a
remote
engine,
which
enables
data
engineers
of
any
skill
level
to
run
their
data
pipelines
within
any
cloud
or
on-premises
environment. In
an
era
of
increasingly
siloed
data
and
the
rapid
growth
of
AI
technologies,
it’s
important
to
prioritize
secure
and
accessible
data
foundations. Get
a
head
start
on
building
a
trusted
data
architecture
with
DataStage-aaS
Anywhere,
the NextGen
solution
built
by
the
trusted
IBM
DataStage
team.

Learn
more
about
DataStage-aas
Anywhere


Try
IBM
DataStage
as
a
Service
for
free

Was
this
article
helpful?


Yes
No

Comments are closed.