hadoop Nitty Gritty Thing

My understanding with product use and observations are as follow

What Waterline can offer –

Waterline Data uses Smart Data Catalog to unlock the value of DataLake, how

1. Time to Value – Automatically catalog (automated profiling) data assets across all the data.

2. Tribal knowledge sharing – Augment semantic discovery by crowdsourcing tribal data knowledge(Metadata discovery).

3. Create Trust – Enable agile governance with automated tagging, data stewardship, and secure self-service access to data based on role and policy.

Waterline Data offerings Snapshot:

1) Cataloging

2) Find and Understand

3) Provision

4) Governance

How Waterline Data InAction –

1. (Server) Profile your data –

Crawling HDFS files to determine each file’s format, Reading each HDFS file to extract field-metadata and data-quality metrics, inserting the metadata and data into Waterline Data Inventory’s repository.

· Repository data to suggest tags on field data, based both on pre-determined reference data and data previously tagged by users (such as product codes or sales regions).

· Using repository data to find files that contain the same data

· Profiled files show data quality metrics and sample data for each field.

1. (UI and Server) Mark landings and run lineage discovery –

A key feature of Waterline Data Inventory is its ability to discover and display relationships among files, such as files that are duplicates of each other or files that contain copies of data from other files.

2. (UI and Server) Tag the data you know –

Now that users can see the wealth of file and field information, they can begin to annotate the data using “tags.” Tags give users a place to record knowledge about files and fields so other users have the benefit of that knowledge.

3. (UI) Leverage discovery results in searches –

In the sandbox sample data, go to Advanced Search and find the tag “Cuisine”. Typing a few letters in the Tags filter box brings up that tag, which is nested under “Food Service.” Select the tag and click Search.

4 . (UI) Bookmark files you want to follow –

Want to know if this file changes or if a coworker has added tags to it? Bookmarking a file or folder allows you to jump right to the file from the Bookmark menu on the top of the Waterline Data Inventory screen.

5. (Server) Run jobs to keep up with new data and users’ tags –

As new data comes into your cluster, you’ll want to run Waterline Data Inventory profiling jobs to make the rich metadata for the new data available to users. In addition, you’ll want to run tagging jobs to make sure that tags added to fields are propagated to new and to existing data that matches the tagged data.

Waterline data works on the top of Atlas, so it offers everything that Atlas do, with adding additional functionality already mentioned above this.

Note – the pre-requested for Waterline Data installation [ Atlas(Graph DB -Titan backed by HBase, Search – Solr/Elastic search), Ranger, HCatalog ]

Few other useful links –

White Paper - http://go.waterlinedata.com/hw-mda
Smart Data Catalog 3.0 - http://www.waterlinedata.com/press_releases/waterline-dataunveils-universal-data-catalogto-empower-citizen-data-scientists/
Installation and Administration Guide - https://s3-us-west-1.amazonaws.com/wld-product-downloads/docs/docsv125/WaterlineDataInventory-InstallandAdminGuide.pdf
Infosys invested into Waterline Data - https://www.infosys.com/newsroom/press-releases/Pages/invests-waterline-data-science.aspx

The WITH clause may be processed as an inline view or resolved as a temporary table.
Considering ,or each employee we want to know how many other people are in their department. Using an inline view we might do the following.

SELECT e.ename AS employee_name,
       dc.dept_count AS emp_dept_count
FROM   emp e,
       (SELECT deptno, COUNT(*) AS dept_count
        FROM   emp
        GROUP BY deptno) dc
WHERE  e.deptno = dc.deptno;

Using a WITH clause this would look like the following.

WITH dept_count AS (
  SELECT deptno, COUNT(*) AS dept_count
  FROM   emp
  GROUP BY deptno)
SELECT e.ename AS employee_name,
       dc.dept_count AS emp_dept_count
FROM   emp e,
       dept_count dc
WHERE  e.deptno = dc.deptno;

The difference seems rather insignificant here.

hadoop Nitty Gritty Thing

Saturday, November 19, 2016

How to ssh into HortonWorks Sandbox using Putty or other Client

Thursday, October 27, 2016

What Waterline Data an Smart Data Catalog can do for DataLake !!

Wednesday, September 21, 2016

Hive WITH Clause - Subquery another way