Databricks Databricks-Certified-Professional-Data-Engineer today updated questions

Databricks Certified Data Engineer Professional Exam Questions and Answers

Question 1

A data ingestion task requires a one-TB JSON dataset to be written out to Parquet with a target part-file size of 512 MB. Because Parquet is being used instead of Delta Lake, built-in file-sizing features such as Auto-Optimize & Auto-Compaction cannot be used.

Which strategy will yield the best performance without shuffling data?

Options:

Set spark.sql.files.maxPartitionBytes to 512 MB, ingest the data, execute the narrow transformations, and then write to parquet.

Set spark.sql.shuffle.partitions to 2,048 partitions (1TB*1024*1024/512), ingest the data, execute the narrow transformations, optimize the data by sorting it (which automatically repartitions the data), and then write to parquet.

Set spark.sql.adaptive.advisoryPartitionSizeInBytes to 512 MB bytes, ingest the data, execute the narrow transformations, coalesce to 2,048 partitions (1TB*1024*1024/512), and then write to parquet.

Ingest the data, execute the narrow transformations, repartition to 2,048 partitions (1TB* 1024*1024/512), and then write to parquet.

Set spark.sql.shuffle.partitions to 512, ingest the data, execute the narrow transformations, and then write to parquet.

Answer:

Explanation:

For this scenario where a one-TB JSON dataset needs to be converted into Parquet format without employing Delta Lake's auto-sizing features, the goal is to avoid unnecessary data shuffles and yet ensure optimal file sizes for the output Parquet files. Here’s a breakdown of why option A is most suitable:

Setting maxPartitionBytes: The spark.sql.files.maxPartitionBytes configuration controls the size of blocks that Spark reads from the data source (in this case, the JSON files) but also influences the output size of files when data is written without repartition or coalesce operations. Setting this parameter to 512 MB directly addresses the requirement to manage the output file size effectively.

Data Ingestion and Processing:

Ingesting Data: Load the JSON dataset into a DataFrame.

Applying Transformations: Perform any required narrow transformations that do not involve shuffling data (like filtering or adding new columns).

Writing to Parquet: Directly write the transformed DataFrame to Parquet files. The setting for maxPartitionBytes ensures that each part-file is approximately 512 MB, meeting the requirement for part-file size without additional steps to repartition or coalesce the data.

Performance Consideration: This approach is optimal because:

It avoids the overhead of shuffling data, which can be significant, especially with large datasets.

It directly ties the read/write operations to a configuration that matches the target output size, making it efficient in terms of both computation and I/O operations.

Alternative Options Analysis:

Option B and D: Involves repartitioning, which would trigger a shuffle of the data, contradicting the requirement to avoid shuffling for performance reasons.

Option C: Uses coalesce, which is less intensive than repartition but can still lead to uneven partition sizes and does not directly control the output file size as effectively as setting maxPartitionBytes.

Option E: Setting shuffle partitions to 512 doesn’t directly control the output file size for writing to Parquet and could lead to smaller files depending on the dataset's partitioning post-transformations.

References

Apache Spark Configuration

Writing to Parquet Files in Spark

Question 2

An upstream system has been configured to pass the date for a given batch of data to the Databricks Jobs API as a parameter. The notebook to be scheduled will use this parameter to load data with the following code:

df = spark.read.format("parquet").load(f"/mnt/source/(date)")

Which code block should be used to create the date Python variable used in the above code block?

Options:

date = spark.conf.get("date")

input_dict = input()

date= input_dict["date"]

import sys

date = sys.argv[1]

date = dbutils.notebooks.getParam("date")

dbutils.widgets.text("date", "null")

date = dbutils.widgets.get("date")

Question 3

A junior data engineer is working to implement logic for a Lakehouse table named silver_device_recordings. The source data contains 100 unique fields in a highly nested JSON structure.

The silver_device_recordings table will be used downstream for highly selective joins on a number of fields, and will also be leveraged by the machine learning team to filter on a handful of relevant fields, in total, 15 fields have been identified that will often be used for filter and join logic.

The data engineer is trying to determine the best approach for dealing with these nested fields before declaring the table schema.

Which of the following accurately presents information about Delta Lake and Databricks that may Impact their decision-making process?

Options:

Because Delta Lake uses Parquet for data storage, Dremel encoding information for nesting can be directly referenced by the Delta transaction log.

Tungsten encoding used by Databricks is optimized for storing string data: newly-added native support for querying JSON strings means that string types are always most efficient.

Schema inference and evolution on Databricks ensure that inferred types will always accurately match the data types used by downstream systems.

By default Delta Lake collects statistics on the first 32 columns in a table; these statistics are leveraged for data skipping when executing selective queries.

Question 4

A Databricks job has been configured with 3 tasks, each of which is a Databricks notebook. Task A does not depend on other tasks. Tasks B and C run in parallel, with each having a serial dependency on task A.

If tasks A and B complete successfully but task C fails during a scheduled run, which statement describes the resulting state?

Options:

All logic expressed in the notebook associated with tasks A and B will have been successfully completed; some operations in task C may have completed successfully.

All logic expressed in the notebook associated with tasks A and B will have been successfully completed; any changes made in task C will be rolled back due to task failure.

All logic expressed in the notebook associated with task A will have been successfully completed; tasks B and C will not commit any changes because of stage failure.

Because all tasks are managed as a dependency graph, no changes will be committed to the Lakehouse until ail tasks have successfully been completed.

Unless all tasks complete successfully, no changes will be committed to the Lakehouse; because task C failed, all commits will be rolled back automatically.

Question 5

An analytics team wants to run a short-term experiment in Databricks SQL on the customer transactions Delta table (about 20 billion records) created by the data engineering team. Which strategy should the data engineering team use to ensure minimal downtime and no impact on the ongoing ETL processes?

Options:

Create a new table for the analytics team using a CTAS statement.

Deep clone the table for the analytics team.

Give the analytics team direct access to the production table.

Shallow clone the table for the analytics team.

Question 6

The data governance team is reviewing user for deleting records for compliance with GDPR. The following logic has been implemented to propagate deleted requests from the user_lookup table to the user aggregate table.

Assuming that user_id is a unique identifying key and that all users have requested deletion have been removed from the user_lookup table, which statement describes whether successfully executing the above logic guarantees that the records to be deleted from the user_aggregates table are no longer accessible and why?

Options:

No: files containing deleted records may still be accessible with time travel until a BACUM command is used to remove invalidated data files.

Yes: Delta Lake ACID guarantees provide assurance that the DELETE command successed fully and permanently purged these records.

No: the change data feed only tracks inserts and updates not deleted records.

No: the Delta Lake DELETE command only provides ACID guarantees when combined with the MERGE INTO command

Question 7

Which Python variable contains a list of directories to be searched when trying to locate required modules?

Options:

importlib.resource path

,sys.path

os-path

pypi.path

pylib.source

Question 8

A task orchestrator has been configured to run two hourly tasks. First, an outside system writes Parquet data to a directory mounted at /mnt/raw_orders/. After this data is written, a Databricks job containing the following code is executed:

(spark.readStream

.format("parquet")

.load("/mnt/raw_orders/")

.withWatermark("time", "2 hours")

.dropDuplicates(["customer_id", "order_id"])

.writeStream

.trigger(once=True)

.table("orders")

)

Assume that the fields customer_id and order_id serve as a composite key to uniquely identify each order, and that the time field indicates when the record was queued in the source system. If the upstream system is known to occasionally enqueue duplicate entries for a single order hours apart, which statement is correct?

Options:

The orders table will not contain duplicates, but records arriving more than 2 hours late will be ignored and missing from the table.

The orders table will contain only the most recent 2 hours of records and no duplicates will be present.

All records will be held in the state store for 2 hours before being deduplicated and committed to the orders table.

Duplicate records enqueued more than 2 hours apart may be retained and the orders table may contain duplicate records with the same customer_id and order_id.

Question 9

The data architect has decided that once data has been ingested from external sources into the

Databricks Lakehouse, table access controls will be leveraged to manage permissions for all production tables and views.

The following logic was executed to grant privileges for interactive queries on a production database to the core engineering group.

GRANT USAGE ON DATABASE prod TO eng;

GRANT SELECT ON DATABASE prod TO eng;

Assuming these are the only privileges that have been granted to the eng group and that these users are not workspace administrators, which statement describes their privileges?

Options:

Group members have full permissions on the prod database and can also assign permissions to other users or groups.

Group members are able to list all tables in the prod database but are not able to see the results of any queries on those tables.

Group members are able to query and modify all tables and views in the prod database, but cannot create new tables or views.

Group members are able to query all tables and views in the prod database, but cannot create or edit anything in the database.

Group members are able to create, query, and modify all tables and views in the prod database, but cannot define custom functions.

Question 10

A Data engineer wants to run unit’s tests using common Python testing frameworks on python functions defined across several Databricks notebooks currently used in production.

How can the data engineer run unit tests against function that work with data in production?

Options:

Run unit tests against non-production data that closely mirrors production

Define and unit test functions using Files in Repos

Define units test and functions within the same notebook

Define and import unit test functions from a separate Databricks notebook

Question 11

Which statement regarding stream-static joins and static Delta tables is correct?

Options:

Each microbatch of a stream-static join will use the most recent version of the static Delta table as of each microbatch.

Each microbatch of a stream-static join will use the most recent version of the static Delta table as of the job's initialization.

The checkpoint directory will be used to track state information for the unique keys present in the join.

Stream-static joins cannot use static Delta tables because of consistency issues.

The checkpoint directory will be used to track updates to the static Delta table.

Question 12

The Databricks CLI is used to trigger a run of an existing job by passing the job_id parameter. The response indicating the job run request was submitted successfully includes a field run_id. Which statement describes what the number alongside this field represents?

Options:

The job_id and number of times the job has been run are concatenated and returned.

The globally unique ID of the newly triggered run.

The job_id is returned in this field.

The number of times the job definition has been run in this workspace.

Question 13

Which configuration parameter directly affects the size of a spark-partition upon ingestion of data into Spark?

Options:

spark.sql.files.maxPartitionBytes

spark.sql.autoBroadcastJoinThreshold

spark.sql.files.openCostInBytes

spark.sql.adaptive.coalescePartitions.minPartitionNum

spark.sql.adaptive.advisoryPartitionSizeInBytes

Question 14

An upstream source writes Parquet data as hourly batches to directories named with the current date. A nightly batch job runs the following code to ingest all data from the previous day as indicated by the date variable:

Assume that the fields customer_id and order_id serve as a composite key to uniquely identify each order.

If the upstream system is known to occasionally produce duplicate entries for a single order hours apart, which statement is correct?

Options:

Each write to the orders table will only contain unique records, and only those records without duplicates in the target table will be written.

Each write to the orders table will only contain unique records, but newly written records may have duplicates already present in the target table.

Each write to the orders table will only contain unique records; if existing records with the same key are present in the target table, these records will be overwritten.

Each write to the orders table will only contain unique records; if existing records with the same key are present in the target table, the operation will tail.

Each write to the orders table will run deduplication over the union of new and existing records, ensuring no duplicate records are present.

Question 15

All records from an Apache Kafka producer are being ingested into a single Delta Lake table with the following schema:

key BINARY, value BINARY, topic STRING, partition LONG, offset LONG, timestamp LONG

There are 5 unique topics being ingested. Only the "registration" topic contains Personal Identifiable Information (PII). The company wishes to restrict access to PII. The company also wishes to only retain records containing PII in this table for 14 days after initial ingestion. However, for non-PII information, it would like to retain these records indefinitely.

Which of the following solutions meets the requirements?

Options:

All data should be deleted biweekly; Delta Lake's time travel functionality should be leveraged to maintain a history of non-PII information.

Data should be partitioned by the registration field, allowing ACLs and delete statements to be set for the PII directory.

Because the value field is stored as binary data, this information is not considered PII and no special precautions should be taken.

Separate object storage containers should be specified based on the partition field, allowing isolation at the storage level.

Data should be partitioned by the topic field, allowing ACLs and delete statements to leverage partition boundaries.

Question 16

A new data engineer notices that a critical field was omitted from an application that writes its Kafka source to Delta Lake. This happened even though the critical field was in the Kafka source. That field was further missing from data written to dependent, long-term storage. The retention threshold on the Kafka service is seven days. The pipeline has been in production for three months.

Which describes how Delta Lake can help to avoid data loss of this nature in the future?

Options:

The Delta log and Structured Streaming checkpoints record the full history of the Kafka producer.

Delta Lake schema evolution can retroactively calculate the correct value for newly added fields, as long as the data was in the original source.

Delta Lake automatically checks that all fields present in the source data are included in the ingestion layer.

Data can never be permanently dropped or deleted from Delta Lake, so data loss is not possible under any circumstance.

Ingestine all raw data and metadata from Kafka to a bronze Delta table creates a permanent, replayable history of the data state.

Question 17

A Delta table of weather records is partitioned by date and has the below schema:

date DATE, device_id INT, temp FLOAT, latitude FLOAT, longitude FLOAT

To find all the records from within the Arctic Circle, you execute a query with the below filter:

latitude > 66.3

Which statement describes how the Delta engine identifies which files to load?

Options:

All records are cached to an operational database and then the filter is applied

The Parquet file footers are scanned for min and max statistics for the latitude column

All records are cached to attached storage and then the filter is applied

The Delta log is scanned for min and max statistics for the latitude column

The Hive metastore is scanned for min and max statistics for the latitude column

Question 18

The view updates represents an incremental batch of all newly ingested data to be inserted or updated in the customers table.

The following logic is used to process these records.

Which statement describes this implementation?

Options:

The customers table is implemented as a Type 3 table; old values are maintained as a new column alongside the current value.

The customers table is implemented as a Type 2 table; old values are maintained but marked as no longer current and new values are inserted.

The customers table is implemented as a Type 0 table; all writes are append only with no changes to existing values.

The customers table is implemented as a Type 1 table; old values are overwritten by new values and no history is maintained.

The customers table is implemented as a Type 2 table; old values are overwritten and new customers are appended.

Question 19

A Delta Lake table with Change Data Feed (CDF) enabled in the Lakehouse named customer_churn_params is used in churn prediction by the machine learning team. The table contains information about customers derived from a number of upstream sources. Currently, the data engineering team populates this table nightly by overwriting the table with the current valid values derived from upstream data sources. The churn prediction model used by the ML team is fairly stable in production. The team is only interested in making predictions on records that have changed in the past 24 hours. Which approach would simplify the identification of these changed records?

Options:

Apply the churn model to all rows in the customer_churn_params table, but implement logic to perform an upsert into the predictions table that ignores rows where predictions have not changed.

Modify the overwrite logic to include a field populated by calling current_timestamp() as data are being written; use this field to identify records written on a particular date.

Replace the current overwrite logic with a MERGE statement to modify only those records that have changed; write logic to make predictions on the changed records identified by the Change Data Feed.

Convert the batch job to a Structured Streaming job using the complete output mode; configure a Structured Streaming job to read from the customer_churn_params table and incrementally predict against the churn model.

Question 20

A data engineer is testing a collection of mathematical functions, one of which calculates the area under a curve as described by another function.

Which kind of the test does the above line exemplify?

Options:

Integration

Unit

Manual

functional

Question 21

The data engineering team is migrating an enterprise system with thousands of tables and views into the Lakehouse. They plan to implement the target architecture using a series of bronze, silver, and gold tables. Bronze tables will almost exclusively be used by production data engineering workloads, while silver tables will be used to support both data engineering and machine learning workloads. Gold tables will largely serve business intelligence and reporting purposes. While personal identifying information (PII) exists in all tiers of data, pseudonymization and anonymization rules are in place for all data at the silver and gold levels.

The organization is interested in reducing security concerns while maximizing the ability to collaborate across diverse teams.

Which statement exemplifies best practices for implementing this system?

Options:

Isolating tables in separate databases based on data quality tiers allows for easy permissions management through database ACLs and allows physical separation of default storage locations for managed tables.

Because databases on Databricks are merely a logical construct, choices around database organization do not impact security or discoverability in the Lakehouse.

Storinq all production tables in a single database provides a unified view of all data assets available throughout the Lakehouse, simplifying discoverability by granting all users view privileges on this database.

Working in the default Databricks database provides the greatest security when working with managed tables, as these will be created in the DBFS root.

Because all tables must live in the same storage containers used for the database they're created in, organizations should be prepared to create between dozens and thousands of databases depending on their data isolation requirements.

Question 22

A data engineer has created a new cluster using shared access mode with default configurations. The data engineer needs to allow the development team access to view the driver logs if needed.

What are the minimal cluster permissions that allow the development team to accomplish this?

Options:

CAN ATTACH TO

CAN MANAGE

CAN VIEW

CAN RESTART

Question 23

The data engineer is using Spark's MEMORY_ONLY storage level.

Which indicators should the data engineer look for in the spark UI's Storage tab to signal that a cached table is not performing optimally?

Options:

Size on Disk is> 0

The number of Cached Partitions> the number of Spark Partitions

The RDD Block Name included the '' annotation signaling failure to cache

On Heap Memory Usage is within 75% of off Heap Memory usage

Question 24

A data engineer needs to capture pipeline settings from an existing in the workspace, and use them to create and version a JSON file to create a new pipeline.

Which command should the data engineer enter in a web terminal configured with the Databricks CLI?

Options:

Use the get command to capture the settings for the existing pipeline; remove the pipeline_id and rename the pipeline; use this in a create command

Stop the existing pipeline; use the returned settings in a reset command

Use the alone command to create a copy of an existing pipeline; use the get JSON command to get the pipeline definition; save this to git

Use list pipelines to get the specs for all pipelines; get the pipeline spec from the return results parse and use this to create a pipeline

Question 25

Where in the Spark UI can one diagnose a performance problem induced by not leveraging predicate push-down?

Options:

In the Executor's log file, by gripping for "predicate push-down"

In the Stage's Detail screen, in the Completed Stages table, by noting the size of data read from the Input column

In the Storage Detail screen, by noting which RDDs are not stored on disk

In the Delta Lake transaction log. by noting the column statistics

In the Query Detail screen, by interpreting the Physical Plan

Question 26

A data engineer has created a transactions Delta table on Databricks that should be used by the analytics team. The analytics team wants to use the table with another tool that requires Apache Iceberg format.

What should the data engineer do?

Options:

Require the analytics team to use a tool that supports Delta table.

Enable uniform on the transactions table to 'iceberg' so that the table can be read as an Iceberg table.

Create an Iceberg copy of the transactions Delta table which can be used by the analytics team.

Convert the transactions Delta table to Iceberg and enable uniform so that the table can be read as a Delta table.

Question 27

A data engineer wants to reflector the following DLT code, which includes multiple definition with very similar code:

In an attempt to programmatically create these tables using a parameterized table definition, the data engineer writes the following code.

The pipeline runs an update with this refactored code, but generates a different DAG showing incorrect configuration values for tables.

How can the data engineer fix this?

Options:

Convert the list of configuration values to a dictionary of table settings, using table names as keys.

Convert the list of configuration values to a dictionary of table settings, using different input the for loop.

Load the configuration values for these tables from a separate file, located at a path provided by a pipeline parameter.

Wrap the loop inside another table definition, using generalized names and properties to replace with those from the inner table

Question 28

The following code has been migrated to a Databricks notebook from a legacy workload:

The code executes successfully and provides the logically correct results, however, it takes over 20 minutes to extract and load around 1 GB of data.

Which statement is a possible explanation for this behavior?

Options:

%sh triggers a cluster restart to collect and install Git. Most of the latency is related to cluster startup time.

Instead of cloning, the code should use %sh pip install so that the Python code can get executed in parallel across all nodes in a cluster.

%sh does not distribute file moving operations; the final line of code should be updated to use %fs instead.

Python will always execute slower than Scala on Databricks. The run.py script should be refactored to Scala.

%sh executes shell code on the driver node. The code does not take advantage of the worker nodes or Databricks optimized Spark.

Question 29

A Databricks SQL dashboard has been configured to monitor the total number of records present in a collection of Delta Lake tables using the following query pattern:

SELECT COUNT (*) FROM table -

Which of the following describes how results are generated each time the dashboard is updated?

Options:

The total count of rows is calculated by scanning all data files

The total count of rows will be returned from cached results unless REFRESH is run

The total count of records is calculated from the Delta transaction logs

The total count of records is calculated from the parquet file metadata

The total count of records is calculated from the Hive metastore

Question 30

Which of the following is true of Delta Lake and the Lakehouse?

Options:

Because Parquet compresses data row by row. strings will only be compressed when a character is repeated multiple times.

Delta Lake automatically collects statistics on the first 32 columns of each table which are leveraged in data skipping based on query filters.

Views in the Lakehouse maintain a valid cache of the most recent versions of source tables at all times.

Primary and foreign key constraints can be leveraged to ensure duplicate values are never entered into a dimension table.

Z-order can only be applied to numeric values stored in Delta Lake tables

Answer:

Explanation:

https://docs.delta.io/2.0.0/table-properties.html

Delta Lake automatically collects statistics on the first 32 columns of each table, which are leveraged in data skipping based on query filters1. Data skipping is a performance optimization technique that aims to avoid reading irrelevant data from the storage layer1. By collecting statistics such as min/max values, null counts, and bloom filters, Delta Lake can efficiently prune unnecessary files or partitions from the query plan1. This can significantly improve the query performance and reduce the I/O cost.

The other options are false because:

Parquet compresses data column by column, not row by row2. This allows for better compression ratios, especially for repeated or similar values within a column2.

Views in the Lakehouse do not maintain a valid cache of the most recent versions of source tables at all times3. Views are logical constructs that are defined by a SQL query on one or more base tables3. Views are not materialized by default, which means they do not store any data, but only the query definition3. Therefore, views always reflect the latest state of the source tables when queried3. However, views can be cached manually using the CACHE TABLE or CREATE TABLE AS SELECT commands.

Primary and foreign key constraints can not be leveraged to ensure duplicate values are never entered into a dimension table. Delta Lake does not support enforcing primary and foreign key constraints on tables. Constraints are logical rules that define the integrity and validity of the data in a table. Delta Lake relies on the application logic or the user to ensure the data quality and consistency.

Z-order can be applied to any values stored in Delta Lake tables, not only numeric values. Z-order is a technique to optimize the layout of the data files by sorting them on one or more columns. Z-order can improve the query performance by clustering related values together and enabling more efficient data skipping. Z-order can be applied to any column that has a defined ordering, such as numeric, string, date, or boolean values.

[References: Data Skipping, Parquet Format, Views, [Caching], [Constraints], [Z-Ordering], ]

Question 31

A data pipeline uses Structured Streaming to ingest data from kafka to Delta Lake. Data is being stored in a bronze table, and includes the Kafka_generated timesamp, key, and value. Three months after the pipeline is deployed the data engineering team has noticed some latency issued during certain times of the day.

A senior data engineer updates the Delta Table's schema and ingestion logic to include the current timestamp (as recoded by Apache Spark) as well the Kafka topic and partition. The team plans to use the additional metadata fields to diagnose the transient processing delays:

Which limitation will the team face while diagnosing this problem?

Options:

New fields not be computed for historic records.

Updating the table schema will invalidate the Delta transaction log metadata.

Updating the table schema requires a default value provided for each file added.

Spark cannot capture the topic partition fields from the kafka source.

Question 32

The security team is exploring whether or not the Databricks secrets module can be leveraged for connecting to an external database.

After testing the code with all Python variables being defined with strings, they upload the password to the secrets module and configure the correct permissions for the currently active user. They then modify their code to the following (leaving all other variables unchanged).

Which statement describes what will happen when the above code is executed?

Options:

The connection to the external table will fail; the string "redacted" will be printed.

An interactive input box will appear in the notebook; if the right password is provided, the connection will succeed and the encoded password will be saved to DBFS.

An interactive input box will appear in the notebook; if the right password is provided, the connection will succeed and the password will be printed in plain text.

The connection to the external table will succeed; the string value of password will be printed in plain text.

The connection to the external table will succeed; the string "redacted" will be printed.

Question 33

A table named user_ltv is being used to create a view that will be used by data analysis on various teams. Users in the workspace are configured into groups, which are used for setting up data access using ACLs.

The user_ltv table has the following schema:

An analyze who is not a member of the auditing group executing the following query:

Which result will be returned by this query?

Options:

All columns will be displayed normally for those records that have an age greater than 18; records not meeting this condition will be omitted.

All columns will be displayed normally for those records that have an age greater than 17; records not meeting this condition will be omitted.

All age values less than 18 will be returned as null values all other columns will be returned with the values in user_ltv.

All records from all columns will be displayed with the values in user_ltv.

Question 34

A data architect has heard about lake's built-in versioning and time travel capabilities. For auditing purposes they have a requirement to maintain a full of all valid street addresses as they appear in the customers table.

The architect is interested in implementing a Type 1 table, overwriting existing records with new values and relying on Delta Lake time travel to support long-term auditing. A data engineer on the project feels that a Type 2 table will provide better performance and scalability.

Which piece of information is critical to this decision?

Options:

Delta Lake time travel does not scale well in cost or latency to provide a long-term versioning solution.

Delta Lake time travel cannot be used to query previous versions of these tables because Type 1 changes modify data files in place.

Shallow clones can be combined with Type 1 tables to accelerate historic queries for long-term versioning.

Data corruption can occur if a query fails in a partially completed state because Type 2 tables requires

Setting multiple fields in a single update.

Question 35

A production cluster has 3 executor nodes and uses the same virtual machine type for the driver and executor.

When evaluating the Ganglia Metrics for this cluster, which indicator would signal a bottleneck caused by code executing on the driver?

Options:

The five Minute Load Average remains consistent/flat

Bytes Received never exceeds 80 million bytes per second

Total Disk Space remains constant

Network I/O never spikes

Overall cluster CPU utilization is around 25%

Question 36

A junior data engineer is migrating a workload from a relational database system to the Databricks Lakehouse. The source system uses a star schema, leveraging foreign key constrains and multi-table inserts to validate records on write.

Which consideration will impact the decisions made by the engineer while migrating this workload?

Options:

All Delta Lake transactions are ACID compliance against a single table, and Databricks does not enforce foreign key constraints.

Databricks only allows foreign key constraints on hashed identifiers, which avoid collisions in highly-parallel writes.

Foreign keys must reference a primary key field; multi-table inserts must leverage Delta Lake's upsert functionality.

Committing to multiple tables simultaneously requires taking out multiple table locks and can lead to a state of deadlock.

Question 37

A data engineer, User A, has promoted a new pipeline to production by using the REST API to programmatically create several jobs. A DevOps engineer, User B, has configured an external orchestration tool to trigger job runs through the REST API. Both users authorized the REST API calls using their personal access tokens.

Which statement describes the contents of the workspace audit logs concerning these events?

Options:

Because the REST API was used for job creation and triggering runs, a Service Principal will be automatically used to identity these events.

Because User B last configured the jobs, their identity will be associated with both the job creation events and the job run events.

Because these events are managed separately, User A will have their identity associated with the job creation events and User B will have their identity associated with the job run events.

Because the REST API was used for job creation and triggering runs, user identity will not be captured in the audit logs.

Because User A created the jobs, their identity will be associated with both the job creation events and the job run events.

Question 38

What statement is true regarding the retention of job run history?

Options:

It is retained until you export or delete job run logs

It is retained for 30 days, during which time you can deliver job run logs to DBFS or S3

t is retained for 60 days, during which you can export notebook run results to HTML

It is retained for 60 days, after which logs are archived

It is retained for 90 days or until the run-id is re-used through custom run configuration

Question 39

A member of the data engineering team has submitted a short notebook that they wish to schedule as part of a larger data pipeline. Assume that the commands provided below produce the logically correct results when run as presented.

Which command should be removed from the notebook before scheduling it as a job?

Options:

Cmd 2

Cmd 3

Cmd 4

Cmd 5

Cmd 6

Question 40

A data engineer is performing a join operating to combine values from a static userlookup table with a streaming DataFrame streamingDF.

Which code block attempts to perform an invalid stream-static join?

Options:

userLookup.join(streamingDF, ["userid"], how="inner")

streamingDF.join(userLookup, ["user_id"], how="outer")

streamingDF.join(userLookup, ["user_id”], how="left")

streamingDF.join(userLookup, ["userid"], how="inner")

userLookup.join(streamingDF, ["user_id"], how="right")

Load More Databricks-Certified-Professional-Data-Engineer Questions

Winter Sale Flat 65% Limited Time Discount offer - Ends in 0d 00h 00m 00s - Coupon code: suredis

Databricks Databricks-Certified-Professional-Data-Engineer Databricks Certified Data Engineer Professional Exam Exam Practice Test

Databricks Certified Data Engineer Professional Exam Questions and Answers

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation: