This David Papkin page contains links on Microsoft Azure DP-200 course.
Use Azure Batch to run large-scale parallel and high-performance computing (HPC) batch jobs efficiently in Azure. Azure Batch creates and manages a pool of compute nodes (virtual machines), installs the applications you want to run, and schedules jobs to run on the nodes.
Bots provide an experience that feels less like using a computer and more like dealing with a person – or at least an intelligent robot.
Quickstart – Create a bot with Azure Bot Service
Azure Cosmos DB
Calculating Cosmos DB Request Units (RU) for CRUD and Queries
Choose the appropriate API for Azure Cosmos DB
Different types of Databases
Document store – document-oriented database systems, are characterized by their schema-free organization of data.
MongoDB is a document database, which means it stores data in JSON-like documents. We believe this is the most natural way to think about data, and is much more expressive and powerful than the traditional row/column model.
Graph DBMS – represent data in graph structures as nodes and edges, which are relationships between nodes. They allow easy processing of data in that form, and simple calculation of specific properties of the graph, such as the number of steps needed to get from one node to another node.
The Benefits of Graph Computing (Apache Tinkerpop)
Example : Gremlin
Gremlin is the graph traversal language of Apache TinkerPop. Gremlin is a functional, data-flow language that enables users to succinctly express complex traversals on (or queries of) their application’s property,developed by Apache TinkerPop of the Apache Software Foundation
Key-value store – simplest form of database management systems. They can only store pairs of keys and values, as well as retrieve values when a key is known.
These simple systems are normally not adequate for complex applications. On the other hand, it is exactly this simplicity, that makes such systems attractive in certain circumstances. For example resource-efficient key-value stores are often applied in embedded systems or as high performance in-process databases.
Wide column store (column base) – store data in records with an ability to hold very large numbers of dynamic columns. Since the column names as well as the record keys are not fixed, and since a record can have billions of columns, wide column stores can be seen as two-dimensional key-value stores.
Cassandra – open-source, distributed, wide column store, NoSQL database management system designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure.
Azure CosmosDB – Azure Cosmos DB is a fully-managed database service with turnkey global distribution and transparent multi-master replication.
Azure Data Bricks
Azure Databricks is the latest Azure offering for data engineering and data science. Databricks’ greatest strengths are its zero-management cloud solution and the collaborative, interactive environment it provides in the form of notebooks.
Databricks is powered by Apache Spark and offers an API layer where a wide span of analytic-based languages can be used to work as comfortably as possible with your data: R, SQL, Python, Scala and Java. The Spark ecosystem also offers a variety of perks such as Streaming, MLib, and GraphX.
Data can be gathered from a variety of sources, such as Blob Storage, ADLS, and from ODBC databases using Sqoop.
Tutorial: Extract, transform, and load data by using Azure Databricks
These notebooks show how to convert JSON data to Delta Lake format, create a Delta table, append to the table, optimize the resulting table, and finally use Delta Lake metadata commands to show the table history, format, and details.
You can manage notebooks using the UI, the CLI, and by invoking the Workspace API. This topic focuses on performing notebook tasks using the UI. For the other methods, see Databricks CLI and Workspace API.
Connecting to SQL Databases using JDBC
Read and Write Apache Parquet file in Spark
How to view Apache Parquet file in Windows?
Azure Data Catalog
Azure Data Catalog documentation
Azure Data Factory
Introduction to Azure Data Factory
Create Azure Data Factory from Cloudshell
Create and configure a self-hosted integration runtime
Azure Data Lake
Azure Data Lake is an on-demand scalable cloud-based storage and analytics service. It can be divided in two connected services, Azure Data Lake Store (ADLS) and Azure Data Lake Analytics (ADLA). ADLS is a cloud-based file system that allows the storage of any type of data with any structure, making it ideal for the analysis and processing of unstructured data.
Azure Data Lake Analytics
Azure Data Lake Analytics is a parallelly-distributed job platform that allows the execution of U-SQL scripts on the Cloud. The syntax is based on SQL with a twist of C#, a general-purpose programming language first released by Microsoft in 2001.
Azure Data Lake Storage
Azure Data Lake Storage Gen1 (previously known as Azure Data Lake Store) is an enterprise-wide hyper-scale repository for big data analytics workloads. Data Lake Storage Gen1 lets you capture data of any size, type, and ingestion speed. The data is captured in a single place for operational and exploratory analytics.
Data Lake Storage Gen2 is the result of converging the capabilities of Microsoft two existing storage services, Azure Blob storage and Azure Data Lake Storage Gen1.
Quickstart: Analyze data in Azure Data Lake Storage Gen2 by using Azure Databricks
Azure HDInsight is a cloud service that allows cost-effective data processing using open-source frameworks such as Hadoop, Spark, Hive, Storm, and Kafka, among others.
Using Apache Sqoop, we can import and export data to and from a multitude of sources, but the native file system that HDInsight uses is either Azure Data Lake Store or Azure Blob Storage.
Cluster types in HDI
|Apache Hadoop||A framework that uses HDFS, YARN resource management, and a simple MapReduce programming model to process and analyze batch data in parallel.|
|Apache Spark||An open-source, parallel-processing framework that supports in-memory processing to boost the performance of big-data analysis applications. See What is Apache Spark in HDInsight?.|
|Apache HBase||A NoSQL database built on Hadoop that provides random access and strong consistency for large amounts of unstructured and semi-structured data–potentially billions of rows times millions of columns. See What is HBase on HDInsight?|
|ML Services||A server for hosting and managing parallel, distributed R processes. It provides data scientists, statisticians, and R programmers with on-demand access to scalable, distributed methods of analytics on HDInsight. See Overview of ML Services on HDInsight.|
|Apache Storm||A distributed, real-time computation system for processing large streams of data fast. Storm is offered as a managed cluster in HDInsight. See Analyze real-time sensor data using Storm and Hadoop.|
|Apache Interactive Query||In-memory caching for interactive and faster Hive queries. See Use Interactive Query in HDInsight.|
|Apache Kafka||An open-source platform that’s used for building streaming data pipelines and applications. Kafka also provides message-queue functionality that allows you to publish and subscribe to data streams. See Introduction to Apache Kafka on HDInsight.|
Azure SQL Data Warehouse
What is Azure SQL Data Warehouse?
SQL Data Warehouse Documentation
Quickstart: Create and query an Azure SQL Data Warehouse in the Azure portal
Azure Stream Analytics
What is Azure Stream Analytics?
Azure Stream Analytics documentation
Azure Event Hub Stream Analytics and Power BI demo (Lab 6 concepts)
Comparison of Databricks vs HDInsight vs Data Lake Analytics
Cloud Analytics on Azure: Databricks vs HDInsight vs Data Lake Analytics
Updated DP-200 Labs
Demo video useful for Lab6b
Azure Event Hub Stream Analytics and Power BI Demo
Extract Lab6B into E:\Allfiles\Labfiles\Starter\DP-200.6 folder.
Extract Lab7 into E:\Allfiles\Instructions folder.
Lambda Architecture implementation using Microsoft Azure
Azure Cosmos DB: Implement a lambda architecture on the Azure platform
Databricks Lambda Architecture
End of David Papkin page containing links on Microsoft Azure DP-200 course.
Helpful Azure learning links
Microsoft Azure Forums The Azure forums are very active. You can search the threads for a
specific area of interest. You can also browse categories like Azure Storage, Pricing
and Billing, Azure Virtual Machines, and Azure Migrate.
Azure Architecture Center Gain access to the Azure Application Architecture Guide,
Azure Reference Architectures, and the Cloud Design Patterns.
Microsoft Learning Community Blog Get the latest information the certification
tests and exam study groups.
https://channel9.msdn.com/ Channel 9 provides a wealth of informational videos, shows, and
Azure Tuesdays With Corey Corey Sanders answers your questions about
Microsoft Azure – Virtual Machines, Web Sites, Mobile Services, Dev/Test etc.
Azure Fridays Join Scott Hanselman as he engages one-on-one with the engineers
who build the services that power Microsoft Azure as they demo capabilities,
answer Scott’s questions, and share their insights.
Microsoft Azure Blog Keep current on what’s happening in Azure, including what’s
now in preview, generally available, news & updates, and more.
End of David Papkin Microsoft Azure page.
David Papkin favorite movies