Maestría en Analítica de datos

17
Maestría en Analítica de datos Big Data Clase N ° 05 | Herramientas ETL disponibles en el mercado Problemas técnicos - Características - Casos de uso Comenzamos a las: Bogotá, 17 de septiembre de 2021

Transcript of Maestría en Analítica de datos

Page 1: Maestría en Analítica de datos

Maestría en Analítica de datosBig Data

Clase N ° 05 | Herramientas ETL disponibles en el mercado

Problemas técnicos - Características - Casos de uso

Comenzamos a las:

Bogotá, 17 de septiembre de 2021

Page 2: Maestría en Analítica de datos

Herramientas ETL en los flujos de trabajo de análisis de datos Repaso

Page 3: Maestría en Analítica de datos

Herramientas ETL: criterios de selección• Criterios funcionales:

• Gestionar diferentes sistemas y formatos de datos de origen y destino

• Realizar procesos de transformación de datos (ordenar, filtrar y agregar).

• Funcionalidad de limpieza de datos, calidad de datos y gobernanza de datosInteroperabilidad

• Funcionalidad de integración continua y entrega continua• Transparencia entre escenarios locales y basados en la nube

• Modularidad (cambios entre proveedores para cada parte / paso / componente del procesoETL)

• Funciones multicloud

• Integración y escalabilidad• Adaptación a nuevas tecnologías y / o implementaciones

• Funciones de escala horizontal / vertical

• Portabilidad entre tecnologías, plataformas e infraestructura

* https://www.talend.com/es/resources/etl-tools/

Page 4: Maestría en Analítica de datos

Integración del software para ETL

https://www.talend.com/es/resources/etl-tools/https://streamsets.com/

* Hortonworks is now fused with Cloudera

Page 5: Maestría en Analítica de datos

Software ETL: ejemplos de GUI

https://www.talend.com/blog/2018/06/25/why-data-scientists-love-python-and-how-to-use-it-with-talend/

https://streamsets.com/blog/automating-pipeline-development-with-the-streamsets-sdk-for-python/

* Diversos marcos de trabajo proporcionan bibliotecas de Python (API) y / o implementan su funcionalidad usando Python

Page 6: Maestría en Analítica de datos

Ejemplos representativos del mercado de herramientas ETL

Page 7: Maestría en Analítica de datos

Fuentes y destinos comunes de los procesos ETL en casos de uso reales *

* Recuperado de https://striim.com

Page 8: Maestría en Analítica de datos

Algunas herramientas y marcos ETL relevantes• Talend Data Integration

• Oracle Data Integrator

• Xplenty

• Informatica Power Center

• Stitch

• FlyData

• Fivetran

• AWS Glue

• Pentaho

• Striim

• Panoply

• Hevo Data

• Matillion

Page 9: Maestría en Analítica de datos

Talend Data Integration

• Open-source ETL data integration solution. • Proprietary Talend’s paid Data

Management Platform with additional tools and features • design, productivity, management,

monitoring, and data governance.

Pros:• Compatible with data sources both on-

premises and in the cloud• Hundreds of pre-built integrations.

Cons:• Several interesting features are restricted

to the paid version.

Use cases: • Companies preferring open-source solutions• Companies requiring multiple (or complex)

pre-built data integrations.

* In Big Data frameworks, multiple operations are performed using “lazy” strategies, i.e.:operations are defined (planned) first on metadata, and, then, executed on the actual data

Page 10: Maestría en Analítica de datos

Oracle Data Integrator

• Part of Oracle’s data management ecosystem. • Both on-premises and cloud versions

• Oracle Data Integration Platform Cloud.

Pros:• Simple articulation process with another

Oracle applications (s.a. Hyperion Financial Management or Oracle E-Business Suite, EBS).

Cons:• Supports ELT workloads (not ETL), • Certain tools require different Oracle

software and suites (could be expensive). • The learning curve is steep.

Use cases:• Companies who already acquired another

components of the Oracle ecosystem• Companies whose data integration processes

are oriented to ELT pipelines.

* Tez and Map-Reduce are approaches to data-oriented distributed computing

Page 11: Maestría en Analítica de datos

Xplenty

• Cloud-based ETL and ELT (extract, load, transform) paid data integration platform• Oriented to visual interfaces building data

pipelines using multiple sources and destinations.• Includes connectivity to MongoDB, MySQL,

PostgreSQL, Amazon Redshift, Google Cloud Platform, Facebook, Salesforce, etc.

Pros:

• Supports scalability and securityconfiguration for several scenarios• Field Level Encryption with per-user encryption

key.

• Regulatory compliance to laws like HIPPA, GDPR, and CCPA.

Cons:

• Per year (rigid) billing. Could be expensive forsmall organizations.

Use cases: • Companies using both ETL and ELT workloads• Companies who prefer visual interfaces• Companies requiring multiple pre-built

integrations• Companies requiring strong data security

features.

Page 12: Maestría en Analítica de datos

Informatica Powercenter

• Enterprise data integration platform for ETL workloads. • PowerCenter is part of the Informatica cloud

data management tools suite.

• Enterprise-class, database-neutral solution including cloud data management tools

• Pros:• High performance and high compatibility

• Cons:• Could be expensive, the payment plans are

complex • Steep learning curve

Use cases:• Large enterprises with robust budgets invested

to solve big data problems and/or heavy analytic workloads, or demanding high performance features.

Page 13: Maestría en Analítica de datos

AWS Glue

• Fully managed ETL service from Amazon Web Services • Designed for big data analytics. • Job scheduling • “Developer endpoints” for testing scripts.

Pros:

• Direct integration with other services and tools in the AWS ecosystem.

• Serverless: Amazon automatically provisions a server for users and shuts it down when the workload is complete.

Cons:

• Could be less flexible than other tools, and typically best suited to users who are already within the AWS ecosystem.

Use cases: • Companies already using AWS for their data

analytics needs.• Companies requiring fully managed ETL

approaches.

Page 14: Maestría en Analítica de datos

Pentaho (Kettle)

• Open-source platform offered by used for both data integration and analytics. • There is a free community edition along

with a commercial license for the software’s enterprise edition.

Pros:

• User-friendly interface that lets to build “robust” ETL pipelines.

Cons:

• Some users report poor documentation, especially for error detection and management.

Hitachi - Vantara

Use cases:• Companies oriented to open-source based

tools and frameworks, including those oriented to ETL processes.

Page 15: Maestría en Analítica de datos

Striim• Real-time data integration for big data.

• Multiple sources and targets of different types.

• About 20 different file formats (Oracle, SQL Server, MySQL, PostgreSQL, MongoDB, Hadoop, etc.).

Pros

• Compliant with data privacy regulations such as GDPR and HIPAA.

• Pre-load transformations using SQL or Java.

Cons

• Does not include SaaS sources or targets.

• Does not allow add new data sources.

• Small community

Use cases:• Companies requiring GDPR or HIPAA compliance.• Companies using a fixed set of data sources and do

not require SaaS interoperability.

Page 16: Maestría en Analítica de datos

FiveTran

• Cloud-based, data warehouse-oriented ETL platform• Data integration with Redshift, BigQuery,

Azure, and Snowflake• About 90 possible SaaS sources • Custom integrations are supported

Pros:

• Easy to use (management and user interfaces)

• New connectors are configured a fast and straightforward manner (in most cases)

Cons:

• The pricing model is complex and depends on the several factors.

• Some uses report poor support for complex technical

Use cases:• Companies requiring several pre-built integrations• Companies using multiple data warehouses of

different types

Page 17: Maestría en Analítica de datos

Stitch• Open-source ELT data integration platform

with paid service extensions for advanced use cases and larger numbers of data sources.

Pros

• Self-service ELT and automated data pipelines..

• Simple pricing

• High performance according to several tests

Cons:

• Does not perform arbitrary transformations, since they are implemented on top of raw data after the Extraction process

• Stitch was acquired by Talend in fall 2018.

• Some technical issues are reported

• Some data sources are poorly supported

Use cases:• Companies oriented to open-source solutions• Companies using simple ELT process• Companies not requiring complex transformations

* Most ELT tools do not implement explicit transformations on raw data, sincesome Data Analytic processes perform them as the first step of the analysis