My understanding with product use and observations are as follow
What Waterline can offer –
Waterline Data uses Smart Data Catalog to unlock the value of DataLake, how
1. Time to Value – Automatically catalog (automated profiling) data assets across all the data.
2. Tribal knowledge sharing – Augment semantic discovery by crowdsourcing tribal data knowledge(Metadata discovery).
3. Create Trust – Enable agile governance with automated tagging, data stewardship, and secure self-service access to data based on role and policy.
Waterline Data offerings Snapshot:
1) Cataloging
2) Find and Understand
3) Provision
4) Governance
How Waterline Data InAction –
1. (Server) Profile your data –
Crawling HDFS files to determine each file’s format, Reading each HDFS file to extract field-metadata and data-quality metrics, inserting the metadata and data into Waterline Data Inventory’s repository.
· Repository data to suggest tags on field data, based both on pre-determined reference data and data previously tagged by users (such as product codes or sales regions).
· Using repository data to find files that contain the same data
· Profiled files show data quality metrics and sample data for each field.
1. (UI and Server) Mark landings and run lineage discovery –
A key feature of Waterline Data Inventory is its ability to discover and display relationships among files, such as files that are duplicates of each other or files that contain copies of data from other files.
2. (UI and Server) Tag the data you know –
Now that users can see the wealth of file and field information, they can begin to annotate the data using “tags.” Tags give users a place to record knowledge about files and fields so other users have the benefit of that knowledge.
3. (UI) Leverage discovery results in searches –
In the sandbox sample data, go to Advanced Search and find the tag “Cuisine”. Typing a few letters in the Tags filter box brings up that tag, which is nested under “Food Service.” Select the tag and click Search.
4 . (UI) Bookmark files you want to follow –
Want to know if this file changes or if a coworker has added tags to it? Bookmarking a file or folder allows you to jump right to the file from the Bookmark menu on the top of the Waterline Data Inventory screen.
5. (Server) Run jobs to keep up with new data and users’ tags –
As new data comes into your cluster, you’ll want to run Waterline Data Inventory profiling jobs to make the rich metadata for the new data available to users. In addition, you’ll want to run tagging jobs to make sure that tags added to fields are propagated to new and to existing data that matches the tagged data.
Waterline data works on the top of Atlas, so it offers everything that Atlas do, with adding additional functionality already mentioned above this.
Note – the pre-requested for Waterline Data installation [ Atlas(Graph DB -Titan backed by HBase, Search – Solr/Elastic search), Ranger, HCatalog ]
Few other useful links –