BARC analyst Timm GrosserBriefing Insights: The Talend Business User Offensive – Talend Data Preparation, Talend Data Stewardship and Talend Data Catalog

Timm Grosser, BARC’s Senior Analyst for Data Management, offers some takeaways from his recent briefing with Talend, creators of the Data Fabric platform

Introduction

Talend is an open source provider of middleware solutions for data management and application integration. Founded in France in 2006, Talend is now headquartered in the United States and has 26 offices worldwide. The company continues to deliver strong annual growth rate. Talend started out with data integration in 2006 and has since added data management functionality. Today, cloud solutions have become mission-critical for the vendor and its strategy is now cloud-first. The company has been listed on the stock exchange since 2016 and has a broad network of over 150 partners.

The goal

Talend’s goal is to make data useful for all organizations and to ‘change the way the world makes decisions’. According to Talend, two points in particular must be met in order to achieve this: speed in the sense of value creation time from data and gaining the trust of data consumers. Without that, data usage cannot be efficient. To achieve these goals, Talend offers a data platform that helps collect, transform, govern and share data. With these capabilities, Talend aims to tame data chaos and help manage data efficiently through data governance.

The technology

At the core of its offering is a unified platform, Talend Data Fabric, which helps customers to transform data into tangible business outcomes. The platform include different modules, which can also be purchased separately or in bundles under license. The goal is to offer a range of data management apps based on one platform for the implementation of various data integration patterns, on-premises, in the cloud or in hybrid scenarios.

Talend portfolio overview

Figure 1: Talend portfolio overview

Talend has been working on integrating business users into data management processes since 2018. In the briefing, we specifically dealt with those modules that have been developed for business users and are in widespread use: Data Catalog, Data Preparation and Data Stewardship.

Data Preparation and Data Steward are closely intertwined with each other and Talend Integration tools, and are therefore considered together below. Talend Data Catalog, on the other hand, can be used as a standalone tool that requires no other Talend products to operate. However, the data catalog reveals its full potential when used as part of the Talend Data Fabric.

Data preparation and data stewardship

To understand the products, we need to look at the end-to-end process from data connectivity to deployment in a data warehouse, data lake scenario or for real-time pushing of data services. Imagine a technical developer connecting new sources in Talend Studio and trying to combine them. However, the data sets are not up to the quality required. In terms of content, the technical developer cannot contribute to cleaning up the data quality and building a good integration path. Technical expertise is required here. With Talend Data Preparation and Talend Data Stewardship, the provider has created interfaces to involve business users directly in the data preparation process. Both products primarily serve to improve data quality or enrich the data.

The technical developer can provide data to the business user in Talend Data Preparation via Talend Studio. The user now has the opportunity to adjust the data step by step in an Excel-like interface. Talend suggests functions for editing and draws attention to any anomalies in the data. My favorite function is ‘Magic fill’. The AI-based function detects patterns, derives rules from them and can apply them. For example, with an input and an output like this:

  • Input 1: Timm – Output 1: T.,
  • Input 2: Carsten – Output 2: C.

Here, the system learns that it should abbreviate first names. There is a lot of potential in this capability. Unfortunately, we didn’t have time to test my more confusing examples 😀

Rules can also be created and applied to check and correct records. Once all anomalies have been resolved by the business user and the rules have been defined, a log – the ‘Preparation’ – shows the progress of the actions taken. This can now be used as a recipe in Talend Studio and included in the technical developer’s pipeline.

Interface for Talend Data Preparation

Figure 2: Interface for Talend Data Preparation

Content is exchanged via the shared repository without the user having to do anything. For easy integration, Talend Studio has its own data preparation components that can be integrated via drag and drop. This also means that the actions of business users can be deployed exclusively through Talend Studio. This is what makes Talend Data Preparation different from other data preparation tools that can be used by single users as a silo application. Talend provides interfaces to involve business users, focusing their input on data quality logic to ensure consistent and traceable data preparation. The recommended approach does not aim for autonomous processing of data by business users in an iterative and experimental way but includes them to give high quality to the data while ensuring governance. However, the tools can also be used for ad hoc preparations if required.

Talend Data Stewardship is used when individual data records have to be manually prepared, corrected or enriched in the strict sense. Accordingly, each individual data record forms a task in the product, which is to be resolved using a predefined workflow. Talend Data Stewardship allows users to create different roles and to design their own workflows such as approval processes or feedback loops. The product supports the data steward in managing his tasks and in enriching/correcting the data sets.

Interface for Talend Data Stewardship

Figure 3: Interface for Talend Data Stewardship

Rules from Talend Data Preparation can be used here, thanks to the shared repository. Here too, Talend Studio is used to provide the data, as well as to reintegrate the prepared data.

Data Catalog

The data catalog focuses on metadata. Talend Inventory is once again a separate product. It differs from the data catalog in its strong reference to the management of Talend Cloud artifacts. Talend Data Catalog, on the other hand, is an open repository that can be operated on-premises or in the cloud. In the latter case, it is known as Talend Cloud Data Catalog. This means that in addition to Talend metadata, metadata from data storage technologies, data integration tools, front-end tools and more can be integrated. For this purpose, the catalog offers its own connectors as well as functions for profiling to collect statistical metadata, for example. Technical metadata can be read automatically if the connector allows it. For manual maintenance of business metadata, there is a Business Glossary. Typical functions such as search and lineage are supported. Functions for collaboration are available but rather rudimentary, as well as the addition of policies, which can be designed via warnings and recommendations for metadata objects.

Data Flow in the Talend Data Catalog

Figure 4: Data Flow in the Talend Data Catalog

Analyst opinion

With Data Fabric, Talend sets and follows the market trend towards cloud-based data fabrics. They provide a foundation along which data and use cases can scale, making the use of data as simple and value-added as possible. I am convinced that data management for analytics cannot succeed without the involvement of business departments. Perhaps I would also like to go a step further and say that, in my opinion, at least the content control of data management belongs to the business department, while the efficient implementation and operation lies in IT. Business departments must take responsibility for data. Be that as it may, being able to use data means making it usable. And this is where business users are called upon. Talend has recognized this and has come up with a special approach. They integrate business users into the IT-related data management processes (e.g., creation of ETL paths, loading of a data lake) through special interfaces that enable interaction with business departments. I like the approach that everything is done under Talend’s control, thus meeting the goal of “governance”. On the other hand, I’m afraid that this approach limits the flexibility desired by business departments due to the fact that the data is ultimately processed through controlled processes rather than self-service. Both points of view have their justification. I am curious to see how well Talend is doing with its strategy, which is more concerned with avoiding data chaos and establishing governance than allowing business users to handle data autonomously and flexibly. But the latter also requires a certain maturity and governance in terms of enterprise organization.

Overall, I have the impression that Talend has made progress in terms of functionality and usability. As far as the data catalog is concerned, I think Talend is on the right track but there is still room for improvement. It is part of Talend’s vision to “take the work out of working with data”. That also means we will see more automation in Data Fabric – thanks to ML. I’m curious about that too.