Microsoft DP-200 / 201 links by David Papkin

This David Papkin page contains links on Microsoft Azure DP-200 course.

Azure Batch

Use Azure Batch to run large-scale parallel and high-performance computing (HPC) batch jobs efficiently in Azure. Azure Batch creates and manages a pool of compute nodes (virtual machines), installs the applications you want to run, and schedules jobs to run on the nodes.

Azure Batch Overview

Azure Bot

Bots provide an experience that feels less like using a computer and more like dealing with a person – or at least an intelligent robot.

About Azure Bot Service

Quickstart – Create a bot with Azure Bot Service

Azure Cosmos DB

Azure Cosmos DB documentation

Azure Cosmos DB introduction

Azure Cosmos DB Quickstart

Azure Cosmos DB vs Azure SQL

Calculating Cosmos DB Request Units (RU) for CRUD and Queries

CosmosDB Capacity Calculator

Choose the appropriate API for Azure Cosmos DB


Different types of Databases

Document storedocument-oriented database systems, are characterized by their schema-free organization of data.


MongoDB is a document database, which means it stores data in JSON-like documents. We believe this is the most natural way to think about data, and is much more expressive and powerful than the traditional row/column model.
Graph DBMS – represent data in graph structures as nodes and edges, which are relationships between nodes. They allow easy processing of data in that form, and simple calculation of specific properties of the graph, such as the number of steps needed to get from one node to another node.

The Benefits of Graph Computing (Apache Tinkerpop)

Example : Gremlin

Gremlin is the graph traversal language of Apache TinkerPop. Gremlin is a functional, data-flow language that enables users to succinctly express complex traversals on (or queries of) their application’s property,developed by Apache TinkerPop of the Apache Software Foundation

Key-value storesimplest form of database management systems. They can only store pairs of keys and values, as well as retrieve values when a key is known.

These simple systems are normally not adequate for complex applications. On the other hand, it is exactly this simplicity, that makes such systems attractive in certain circumstances. For example resource-efficient key-value stores are often applied in embedded systems or as high performance in-process databases.

Wide column store (column base) – store data in records with an ability to hold very large numbers of dynamic columns. Since the column names as well as the record keys are not fixed, and since a record can have billions of columns, wide column stores can be seen as two-dimensional key-value stores.

Cassandraopen-sourcedistributedwide column storeNoSQL database management system designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure.

Azure CosmosDB – Azure Cosmos DB is a fully-managed database service with turnkey global distribution and transparent multi-master replication.

Azure SQL – relational DB

Azure Data Bricks

What is Azure Data Bricks?

Azure Databricks is the latest Azure offering for data engineering and data science. Databricks’ greatest strengths are its zero-management cloud solution and the collaborative, interactive environment it provides in the form of notebooks.

Databricks is powered by Apache Spark and offers an API layer where a wide span of analytic-based languages can be used to work as comfortably as possible with your data: R, SQL, Python, Scala and Java. The Spark ecosystem also offers a variety of perks such as Streaming, MLib, and GraphX.

Data can be gathered from a variety of sources, such as Blob Storage, ADLS, and from ODBC databases using Sqoop.

Tutorial: Extract, transform, and load data by using Azure Databricks

These notebooks show how to convert JSON data to Delta Lake format, create a Delta table, append to the table, optimize the resulting table, and finally use Delta Lake metadata commands to show the table history, format, and details.

Manage Notebooks

You can manage notebooks using the UI, the CLI, and by invoking the Workspace API. This topic focuses on performing notebook tasks using the UI. For the other methods, see Databricks CLI and Workspace API.

Introductory Notebooks

Connecting to SQL Databases using JDBC

JDBC – Introduction

JDBC To Other Databases

Read and Write Apache Parquet file in Spark

How to view Apache Parquet file in Windows?

Azure Data Catalog

What is Azure Data Catalog?

Azure Data Catalog documentation

Azure Data Factory

Introduction to Azure Data Factory

Azure Data Factory

Create Azure Data Factory from Cloudshell

Azure Data Lake

Azure Data Lake is an on-demand scalable cloud-based storage and analytics service. It can be divided in two connected services, Azure Data Lake Store (ADLS) and Azure Data Lake Analytics (ADLA). ADLS is a cloud-based file system that allows the storage of any type of data with any structure, making it ideal for the analysis and processing of unstructured data.

Azure Data Lake Analytics

Azure Data Lake Analytics is a parallelly-distributed job platform that allows the execution of U-SQL scripts on the Cloud. The syntax is based on SQL with a twist of C#, a general-purpose programming language first released by Microsoft in 2001.

Azure Data Lake Storage

Azure Data Lake Storage Gen1 (previously known as Azure Data Lake Store) is an enterprise-wide hyper-scale repository for big data analytics workloads. Data Lake Storage Gen1 lets you capture data of any size, type, and ingestion speed. The data is captured in a single place for operational and exploratory analytics.

Data Lake Storage Gen2 is the result of converging the capabilities of Microsoft two existing storage services, Azure Blob storage and Azure Data Lake Storage Gen1.

Azure Data Lake Storage Docs

Quickstart: Analyze data in Azure Data Lake Storage Gen2 by using Azure Databricks

What is Apache Hadoop?

Azure HDInsight

Azure HDInsight is a cloud service that allows cost-effective data processing using open-source frameworks such as Hadoop, Spark, Hive, Storm, and Kafka, among others.

Using Apache Sqoop, we can import and export data to and from a multitude of sources, but the native file system that HDInsight uses is either Azure Data Lake Store or Azure Blob Storage.

What is Azure HDInsight?

Azure HDInsight documentation

Cluster types in HDI

Cluster Type Description
Apache Hadoop A framework that uses HDFS, YARN resource management, and a simple MapReduce programming model to process and analyze batch data in parallel.
Apache Spark An open-source, parallel-processing framework that supports in-memory processing to boost the performance of big-data analysis applications. See What is Apache Spark in HDInsight?.
Apache HBase A NoSQL database built on Hadoop that provides random access and strong consistency for large amounts of unstructured and semi-structured data–potentially billions of rows times millions of columns. See What is HBase on HDInsight?
ML Services A server for hosting and managing parallel, distributed R processes. It provides data scientists, statisticians, and R programmers with on-demand access to scalable, distributed methods of analytics on HDInsight. See Overview of ML Services on HDInsight.
Apache Storm A distributed, real-time computation system for processing large streams of data fast. Storm is offered as a managed cluster in HDInsight. See Analyze real-time sensor data using Storm and Hadoop.
Apache Interactive Query In-memory caching for interactive and faster Hive queries. See Use Interactive Query in HDInsight.
Apache Kafka An open-source platform that’s used for building streaming data pipelines and applications. Kafka also provides message-queue functionality that allows you to publish and subscribe to data streams. See Introduction to Apache Kafka on HDInsight.

Azure SQL Data Warehouse

What is Azure SQL Data Warehouse?

SQL Data Warehouse Documentation

Quickstart: Create and query an Azure SQL Data Warehouse in the Azure portal

What is Polybase

Polybase Tutorial

Azure Stream Analytics

What is Azure Stream Analytics?

Azure Stream Analytics documentation

Azure Event Hub Stream Analytics and Power BI demo (Lab 6 concepts)

Comparison of Databricks vs HDInsight vs Data Lake Analytics

Cloud Analytics on Azure: Databricks vs HDInsight vs Data Lake Analytics

Updated DP-200 Labs

Demo video useful for Lab6b

Azure Event Hub Stream Analytics and Power BI Demo

Extract Lab6B into E:\Allfiles\Labfiles\Starter\DP-200.6 folder.

New DP-200 Lab 6B

Extract Lab7 into E:\Allfiles\Instructions folder.

Updated DP-200 Lab 7


Lambda Architecture implementation using Microsoft Azure

Big data architectures

Azure Cosmos DB: Implement a lambda architecture on the Azure platform

Databricks Lambda Architecture


Dynamic Data Masking

\AXA feedback

End of David Papkin page containing links on Microsoft Azure DP-200 course.

Helpful Azure  learning links

Microsoft Azure Forums  The Azure forums are very active. You can search the threads for a
specific area of interest. You can also browse categories like Azure Storage, Pricing
and Billing, Azure Virtual Machines, and Azure Migrate.

Azure Architecture Center  Gain access to the Azure Application Architecture Guide,
Azure Reference Architectures, and the Cloud Design Patterns.

Microsoft Learning Community Blog  Get the latest information the certification
tests and exam study groups.  Channel 9 provides a wealth of informational videos, shows, and

Azure Tuesdays With Corey  Corey Sanders answers your questions about
Microsoft Azure – Virtual Machines, Web Sites, Mobile Services, Dev/Test etc.

Azure Fridays  Join Scott Hanselman as he engages one-on-one with the engineers
who build the services that power Microsoft Azure as they demo capabilities,
answer Scott’s questions, and share their insights.

Microsoft Azure Blog  Keep current on what’s happening in Azure, including what’s
now in preview, generally available, news & updates, and more.

End of David Papkin Microsoft Azure page.

David Papkin favorite movies

Robert Deniro in GoodFellas