This David Papkin page contains links on Microsoft Azure DP-200 course.
Use Azure Batch to run large-scale parallel and high-performance computing (HPC) batch jobs efficiently in Azure. Azure Batch creates and manages a pool of compute nodes (virtual machines), installs the applications you want to run, and schedules jobs to run on the nodes.
Bots provide an experience that feels less like using a computer and more like dealing with a person – or at least an intelligent robot.
Azure Cosmos DB
Different types of Databases
Document store – document-oriented database systems, are characterized by their schema-free organization of data.
MongoDB is a document database, which means it stores data in JSON-like documents. We believe this is the most natural way to think about data, and is much more expressive and powerful than the traditional row/column model.
Graph DBMS – represent data in graph structures as nodes and edges, which are relationships between nodes. They allow easy processing of data in that form, and simple calculation of specific properties of the graph, such as the number of steps needed to get from one node to another node.
Example : Gremlin
Gremlin is the graph traversal language of Apache TinkerPop. Gremlin is a functional, data-flow language that enables users to succinctly express complex traversals on (or queries of) their application’s property,developed by Apache TinkerPop of the Apache Software Foundation
These simple systems are normally not adequate for complex applications. On the other hand, it is exactly this simplicity, that makes such systems attractive in certain circumstances. For example resource-efficient key-value stores are often applied in embedded systems or as high performance in-process databases.
Wide column store (column base) – store data in records with an ability to hold very large numbers of dynamic columns. Since the column names as well as the record keys are not fixed, and since a record can have billions of columns, wide column stores can be seen as two-dimensional key-value stores.
Cassandra – open-source, distributed, wide column store, NoSQL database management system designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure.
Azure CosmosDB – Azure Cosmos DB is a fully-managed database service with turnkey global distribution and transparent multi-master replication.
Azure Data Bricks
Azure Databricks is the latest Azure offering for data engineering and data science. Databricks’ greatest strengths are its zero-management cloud solution and the collaborative, interactive environment it provides in the form of notebooks.
Databricks is powered by Apache Spark and offers an API layer where a wide span of analytic-based languages can be used to work as comfortably as possible with your data: R, SQL, Python, Scala and Java. The Spark ecosystem also offers a variety of perks such as Streaming, MLib, and GraphX.
Data can be gathered from a variety of sources, such as Blob Storage, ADLS, and from ODBC databases using Sqoop.
These notebooks show how to convert JSON data to Delta Lake format, create a Delta table, append to the table, optimize the resulting table, and finally use Delta Lake metadata commands to show the table history, format, and details.
You can manage notebooks using the UI, the CLI, and by invoking the Workspace API. This topic focuses on performing notebook tasks using the UI. For the other methods, see Databricks CLI and Workspace API.
Azure Data Catalog
Azure Data Factory
Azure Data Lake
Azure Data Lake is an on-demand scalable cloud-based storage and analytics service. It can be divided in two connected services, Azure Data Lake Store (ADLS) and Azure Data Lake Analytics (ADLA). ADLS is a cloud-based file system that allows the storage of any type of data with any structure, making it ideal for the analysis and processing of unstructured data.
Azure Data Lake Analytics
Azure Data Lake Analytics is a parallelly-distributed job platform that allows the execution of U-SQL scripts on the Cloud. The syntax is based on SQL with a twist of C#, a general-purpose programming language first released by Microsoft in 2001.
Azure Data Lake Storage
Azure Data Lake Storage Gen1 (previously known as Azure Data Lake Store) is an enterprise-wide hyper-scale repository for big data analytics workloads. Data Lake Storage Gen1 lets you capture data of any size, type, and ingestion speed. The data is captured in a single place for operational and exploratory analytics.
Data Lake Storage Gen2 is the result of converging the capabilities of Microsoft two existing storage services, Azure Blob storage and Azure Data Lake Storage Gen1.
Azure HDInsight is a cloud service that allows cost-effective data processing using open-source frameworks such as Hadoop, Spark, Hive, Storm, and Kafka, among others.
Using Apache Sqoop, we can import and export data to and from a multitude of sources, but the native file system that HDInsight uses is either Azure Data Lake Store or Azure Blob Storage.
Cluster types in HDI
|Apache Hadoop||A framework that uses HDFS, YARN resource management, and a simple MapReduce programming model to process and analyze batch data in parallel.|
|Apache Spark||An open-source, parallel-processing framework that supports in-memory processing to boost the performance of big-data analysis applications. See What is Apache Spark in HDInsight?.|
|Apache HBase||A NoSQL database built on Hadoop that provides random access and strong consistency for large amounts of unstructured and semi-structured data–potentially billions of rows times millions of columns. See What is HBase on HDInsight?|
|ML Services||A server for hosting and managing parallel, distributed R processes. It provides data scientists, statisticians, and R programmers with on-demand access to scalable, distributed methods of analytics on HDInsight. See Overview of ML Services on HDInsight.|
|Apache Storm||A distributed, real-time computation system for processing large streams of data fast. Storm is offered as a managed cluster in HDInsight. See Analyze real-time sensor data using Storm and Hadoop.|
|Apache Interactive Query||In-memory caching for interactive and faster Hive queries. See Use Interactive Query in HDInsight.|
|Apache Kafka||An open-source platform that’s used for building streaming data pipelines and applications. Kafka also provides message-queue functionality that allows you to publish and subscribe to data streams. See Introduction to Apache Kafka on HDInsight.|
Azure SQL Data Warehouse
Azure Stream Analytics
Comparison of Databricks vs HDInsight vs Data Lake Analytics
Updated DP-200 Labs
Demo video useful for Lab6b
Extract Lab6B into E:\Allfiles\Labfiles\Starter\DP-200.6 folder.
Extract Lab7 into E:\Allfiles\Instructions folder.
End of David Papkin page containing links on Microsoft Azure DP-200 course.
Helpful Azure learning links
Microsoft Azure Forums The Azure forums are very active. You can search the threads for a
specific area of interest. You can also browse categories like Azure Storage, Pricing
and Billing, Azure Virtual Machines, and Azure Migrate.
Azure Architecture Center Gain access to the Azure Application Architecture Guide,
Azure Reference Architectures, and the Cloud Design Patterns.
Microsoft Learning Community Blog Get the latest information the certification
tests and exam study groups.
https://channel9.msdn.com/ Channel 9 provides a wealth of informational videos, shows, and
Azure Tuesdays With Corey Corey Sanders answers your questions about
Microsoft Azure – Virtual Machines, Web Sites, Mobile Services, Dev/Test etc.
Azure Fridays Join Scott Hanselman as he engages one-on-one with the engineers
who build the services that power Microsoft Azure as they demo capabilities,
answer Scott’s questions, and share their insights.
Microsoft Azure Blog Keep current on what’s happening in Azure, including what’s
now in preview, generally available, news & updates, and more.
End of David Papkin Microsoft Azure page.
David Papkin favorite movies