Data Governance with Apache Atlas Atlas in Cloudera Dat Platform Part 2 of 3 ClearPeaks blog SM

Data Governance with Apache Atlas: Atlas in Cloudera Data Platform (Part 2 of 3)

In the first article in this series, we learned about Apache’s open-source offering for data governance, Apache Atlas, and took a detailed look at its architecture, capabilities, and web user interface. Nowadays, Apache Atlas is included in the offerings of many tech vendors, and one such vendor is the big data platform leader, Cloudera. In this second part of the series, we’ll see exactly how Atlas is integrated with Cloudera Data Platform (CDP), and how it provides connected governance when enabled with other Cloudera services.

 

 

Apache Atlas in Cloudera

 

Apache Atlas originated with Hortonworks Data Platform (HDP), and after the merger of Cloudera and Hortonworks, Apache Atlas was chosen as the de facto governance tool for CDP, the new data platform resulting from the mix of both HDP and CDH (the old Cloudera Distribution of Hadoop).

 

Within CDP, Apache Atlas is a standard metadata store designed to exchange metadata inside and outside the Hadoop stack, providing scalable governance for organisations and enterprises that are driven by metadata. It can be installed like any other CDP service, via the Cloudera Manager Admin Console, and requires HBase, Kafka, and SolR as dependent services (we’ll see why in the next section).

 

Edited 1 Atlas in Cloudera Manager

Figure 1: Atlas in Cloudera Manager
(https://docs.cloudera.com/cdp-private-cloud-base/7.1.7/atlas-configuring/topics/atlas-configure-status.html)

 

Atlas also supports multiple instances of the service in an active-passive configuration. Within Hadoop, Atlas provides various service hooks with command line interfaces that load metadata information from HBase, Hive, Impala, NiFi, Spark, Sqoop, and Kafka. However, Atlas also supports building custom hooks or, in simple terms, connectors (we’ll be going into this in more depth in our next post in this series).

 

Finally, the close integration of Atlas with Apache Ranger also enables you to define, administer, and manage security and compliance policies consistently, across all the Hadoop stack components.

 

 

Metadata in Cloudera

 

Figure 2 How Atlas loads metadata in Cloudera

Figure 2: How Atlas loads metadata in Cloudera
(https://docs.cloudera.com/cdp-private-cloud-base/7.1.7/cdp-governance-overview/topics/atlas-overview.html)

 

In Cloudera, many data processing and storage services come with Atlas add-ons (the hooks) that enable them to release activity metadata on a Kafka topic called ATLAS_HOOK. When an action occurs in a Cloudera service instance (for example, the creation of a new table in Hive), the appropriate Atlas hook populates a metadata entity by gathering information and publishing this as a message on the ATLAS_HOOK Kafka topic.

 

When analysing this message, Atlas determines which data will lead to creating new entities, and which information will lead to updating existing entities. Atlas creates and updates the appropriate entities and resolves lineage from existing entities to the new ones.

 

Figure 3 How metadata is updated via the ATLAS_HOOK Kafka topic

Figure 3: How metadata is updated via the ATLAS_HOOK Kafka topic

 

In addition, applications interested in metadata changes can monitor messages on the ATLAS_ENTITIES Kafka topic.

 

Internally, to model the relationships between entities, Atlas reads such messages and stores them both in JanusGraph, backed by an HBase database, and as search indexes in Solr.

 

Cloudera provides a set of Apache Atlas entities that are referenced for the collection of metadata information. Below you can see a table of available Atlas types when Atlas is part of the current deployment:

 

Source

Actions Acknowledged

Entities Created/Updated

HiveServer

ALTER DATABASE
CREATE DATABASE
DROP DATABASE

hive_db, hive_db_ddl

ALTER TABLE
CREATE TABLE
CREATE TABLE AS SELECT
DROP TABLE

hive_process, hive_process_execution, hive_table, hive_table_ddl, hive_column, hive_column_lineage, hive_storagedesc, hdfs_path

ALTER VIEW
ALTERVIEW_AS_SELECT
CREATE VIEW
CREATE VIEW AS SELECT
DROP VIEW

hive_process, hive_process_execution, hive_table, hive_column, hive_column_lineage, hive_table_ddl

INSERT INTO (SELECT)
INSERT OVERWRITE

ive_process, hive_process_execution

HBase

alter_async

hbase_namespace, hbase_table, hbase_column_family

create_namespace alter_namespace drop_namespace

hbase_namespace

create table
alter table
drop table
drop_all tables

hbase_table, hbase_column_family

alter table (create column family) alter table (alter column family) alter table (delete column family)

hbase_table, hbase_column_family

Impala*

CREATETABLE_AS_SELECT

impala_process, impala_process_execution, impala_column_lineage, hive_db hive_table_ddl

CREATEVIEW

impala_process, impala_process_execution, impala_column_lineage, hive_table_ddl

ALTERVIEW_AS_SELECT

impala_process, impala_process_execution, impala_column_lineage, hive_table_ddl

INSERT INTO
INSERT
OVERWRITE

impala_process,
impala_process_execution

Spark*

CREATE TABLE USING
CREATE TABLE AS SELECT,
CREATE TABLE USING … AS SELECT

spark_process

CREATE VIEW AS SELECT,

spark_process

INSERT INTO (SELECT),
LOAD DATA [LOCAL] INPATH

spark_process

 

 

Atlas in Action

 

We’ve talked about how Atlas hooks (or bridges) load metadata from various services, so now let’s run through an example for Hive DB in CDP.

 

Every time we create a Hive database in Cloudera Data Platform, different attributes of that table are loaded into the Atlas entity type hive_table. Here’s an example of the default hive_table structural attributes:

 

Attributes: 
    name:             string 
    DB:               hive_db 
    owner:            string 
    createTime:       date 
    lastAccessTime:   date 
    comment:          string 
    retention:        int 
    sd:               hive_storagedesc 
    partitionKeys:    array 
    aliases:          array 
    columns:          array 
    parameters:       map<string,string> 
    viewOriginalText: string 
    viewExpandedText: string 
    tableType:        string 
    temporary:        Boolean 

 

Let’s create a Hive database in our Cloudera Hadoop platform using the command below:

 

CREATE DATABASE atlas_test_db; 

 

After its successful creation, we will see that a new entity, type hive_db, is also created in Apache Atlas. Note how it also has additional information about this Hive database, which we never provided:

 

Figure 4 List of attributes captured for hive_db entity in Cloudera

Figure 4: List of attributes captured for hive_db entity in Cloudera

 

The information is extracted from the underlying system and stored as metadata in Apache Atlas. This information tells us more about this database, such as where it is stored on the disk, who is the owner of this database, when this table was created, and so on:

 

Figure 5 Audits captured for hive_db entity in Cloudera

Figure 5: Audits captured for hive_db entity in Cloudera

 

Now we’re going to create a table in the above database using the script shown below. Once again, the information in this table is unimportant; our real interest lies in the metadata associated with this table in Apache Atlas:

 

create external table AtlasTestHive  
( 
id string, 
name string, 
group_name string 
)        
ROW FORMAT DELIMITED 
FIELDS TERMINATED BY ',' 
STORED AS TEXTFILE 
LOCATION '/tmp/testAtlas' 
tblproperties ("skip.header.line.count"="1"); 

 

A table relationship pointing to the Hive DB is also built after the creation of the table; an audit log simultaneously captures entity creation, update, and delete events:

 

Figure6-Table-audits-captured-for-hive_db-entity-in-Cloudera

Figure 6:Table audits captured for hive_db entity in Cloudera

 

We created three fields: id, name, and group_name, each representing a string data type. Soon after the table was built in Hive, an entity of the type hive_table was also created in Atlas, collecting a good amount of information on the table. A unique qualified name is assigned to this entity. In addition, not only do we see the columns (of type hive_column) and DB name (of type hive_db), but also information related to the creation time of the table, when it was accessed, who is the owner, what type of table it is, etc.

 

Figure 7 Entity of type hive_db
Figure 8 Entity of type hive_db

Figure 7,8: Entity of type hive_db

 

Now, let’s wrap up the overall process. We had a location in HDFS, which created an external table called “atlastesthive” in the hive. The completed lineage is also automatically created by Atlas, and can be viewed in the Lineage tab of that entity:

 

Figure 9 Data lineage for the AtlasTestHive table

Figure 9: Data lineage for the AtlasTestHive table

 

Similarly, there are several other tabs available: in the Relationship tab, we can see column relationships, database relationships, process relationships, storage descriptions, etc.:

 

Figure 10 Relationship details of the AtlasTestHive table

Figure 10: Relationship details of the AtlasTestHive table

 

However, some of this information is repeated in other tabs, like the Schema tab, which also has details about columns:

 

Figure 11 Schema details of the AtlasTestHive table

Figure 11: Schema details of the AtlasTestHive table

 

Now, let’s click on the “id” column: we can see all the information related to this column, categorised as a new entity of type hive_column:

 

Figure-12-Hive-column(id)-metadata-details

Figure 12: Hive column (id) metadata details

 

Summarising, the overall process gives us information about three important entity types – hive_db, hive_table, and hive_column. Depending on the service used for the processing, we can utilise the type we’re working with in any data processing job. For example, we could use a Kafka Topic in our data processing and have the entity type kafka_topic capturing metadata about it in Apache Atlas. Atlas, in Cloudera, provides these entity types for almost all the services that are used inside, making CDP really convenient for displaying the organisation’s connected governance efforts.

 

Figure 13 Overall entity types utilised in the example

Figure 13: Overall entity types utilised in the example

 

 

Next Steps

 

In this article, we’ve looked at Atlas in the context of Cloudera Data Platform, and gone through data processing as well as moving data from HDFS to Hive. We’ve also analysed the background metadata model created by Apache Atlas for our overall process, and seen different types created as part of Cloudera’s built-in Atlas type.

 

In our next article, we will build a custom entity type and create hooks to load data and relationships in Apache Atlas.

 

If you have any questions about what you’ve read, or if you are simply interested in discovering more about Cloudera, Apache Atlas, or any other data tool for that matter – do not hesitate to contact us at ClearPeaks. Our certified experts will be more than happy to help and guide you along your journey through Business Intelligence, Big Data, Advanced Analytics, and more!

 

Big Data and Cloud Services blog banner

Saqib T
saqib.tamli@clearpeaks.com