Big Halloween Sale Limited Time 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: 70percent

Databricks Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Databricks Certified Associate Developer for Apache Spark 3.5 – Python Exam Practice Test

Databricks Certified Associate Developer for Apache Spark 3.5 – Python Questions and Answers

Question 1

The following code fragment results in an error:

Which code fragment should be used instead?

A)

B)

C)

D)

Options:

Question 2

31 of 55.

Given a DataFrame df that has 10 partitions, after running the code:

df.repartition(20)

How many partitions will the result DataFrame have?

Options:

A.

5

B.

20

C.

Same number as the cluster executors

D.

10

Question 3

A data scientist at a financial services company is working with a Spark DataFrame containing transaction records. The DataFrame has millions of rows and includes columns for transaction_id, account_number, transaction_amount, and timestamp. Due to an issue with the source system, some transactions were accidentally recorded multiple times with identical information across all fields. The data scientist needs to remove rows with duplicates across all fields to ensure accurate financial reporting.

Which approach should the data scientist use to deduplicate the orders using PySpark?

Options:

A.

df = df.dropDuplicates()

B.

df = df.groupBy("transaction_id").agg(F.first("account_number"), F.first("transaction_amount"), F.first("timestamp"))

C.

df = df.filter(F.col("transaction_id").isNotNull())

D.

df = df.dropDuplicates(["transaction_amount"])

Question 4

43 of 55.

An organization has been running a Spark application in production and is considering disabling the Spark History Server to reduce resource usage.

What will be the impact of disabling the Spark History Server in production?

Options:

A.

Prevention of driver log accumulation during long-running jobs

B.

Improved job execution speed due to reduced logging overhead

C.

Loss of access to past job logs and reduced debugging capability for completed jobs

D.

Enhanced executor performance due to reduced log size

Question 5

10 of 55.

What is the benefit of using Pandas API on Spark for data transformations?

Options:

A.

It executes queries faster using all the available cores in the cluster as well as provides Pandas's rich set of features.

B.

It is available only with Python, thereby reducing the learning curve.

C.

It runs on a single node only, utilizing memory efficiently.

D.

It computes results immediately using eager execution.

Question 6

12 of 55.

A data scientist has been investigating user profile data to build features for their model. After some exploratory data analysis, the data scientist identified that some records in the user profiles contain NULL values in too many fields to be useful.

The schema of the user profile table looks like this:

user_id STRING,

username STRING,

date_of_birth DATE,

country STRING,

created_at TIMESTAMP

The data scientist decided that if any record contains a NULL value in any field, they want to remove that record from the output before further processing.

Which block of Spark code can be used to achieve these requirements?

Options:

A.

filtered_users = raw_users.na.drop("any")

B.

filtered_users = raw_users.na.drop("all")

C.

filtered_users = raw_users.dropna(how="any")

D.

filtered_users = raw_users.dropna(how="all")

Question 7

Which configuration can be enabled to optimize the conversion between Pandas and PySpark DataFrames using Apache Arrow?

Options:

A.

spark.conf.set("spark.pandas.arrow.enabled", "true")

B.

spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")

C.

spark.conf.set("spark.sql.execution.arrow.enabled", "true")

D.

spark.conf.set("spark.sql.arrow.pandas.enabled", "true")

Question 8

29 of 55.

A Spark application is experiencing performance issues in client mode due to the driver being resource-constrained.

How should this issue be resolved?

Options:

A.

Switch the deployment mode to cluster mode.

B.

Add more executor instances to the cluster.

C.

Increase the driver memory on the client machine.

D.

Switch the deployment mode to local mode.

Question 9

13 of 55.

A developer needs to produce a Python dictionary using data stored in a small Parquet table, which looks like this:

region_id

region_name

10

North

12

East

14

West

The resulting Python dictionary must contain a mapping of region_id to region_name, containing the smallest 3 region_id values.

Which code fragment meets the requirements?

Options:

A.

regions_dict = dict(regions.take(3))

B.

regions_dict = regions.select("region_id", "region_name").take(3)

C.

regions_dict = dict(regions.select("region_id", "region_name").rdd.collect())

D.

regions_dict = dict(regions.orderBy("region_id").limit(3).rdd.map(lambda x: (x.region_id, x.region_name)).collect())

Question 10

A data engineer wants to write a Spark job that creates a new managed table. If the table already exists, the job should fail and not modify anything.

Which save mode and method should be used?

Options:

A.

saveAsTable with mode ErrorIfExists

B.

saveAsTable with mode Overwrite

C.

save with mode Ignore

D.

save with mode ErrorIfExists

Question 11

What is the behavior for function date_sub(start, days) if a negative value is passed into the days parameter?

Options:

A.

The same start date will be returned

B.

An error message of an invalid parameter will be returned

C.

The number of days specified will be added to the start date

D.

The number of days specified will be removed from the start date

Question 12

26 of 55.

A data scientist at an e-commerce company is working with user data obtained from its subscriber database and has stored the data in a DataFrame df_user.

Before further processing, the data scientist wants to create another DataFrame df_user_non_pii and store only the non-PII columns.

The PII columns in df_user are name, email, and birthdate.

Which code snippet can be used to meet this requirement?

Options:

A.

df_user_non_pii = df_user.drop("name", "email", "birthdate")

B.

df_user_non_pii = df_user.dropFields("name", "email", "birthdate")

C.

df_user_non_pii = df_user.select("name", "email", "birthdate")

D.

df_user_non_pii = df_user.remove("name", "email", "birthdate")

Question 13

Which Spark configuration controls the number of tasks that can run in parallel on the executor?

Options:

Options:

A.

spark.executor.cores

B.

spark.task.maxFailures

C.

spark.driver.cores

D.

spark.executor.memory

Question 14

A developer wants to test Spark Connect with an existing Spark application.

What are the two alternative ways the developer can start a local Spark Connect server without changing their existing application code? (Choose 2 answers)

Options:

A.

Execute their pyspark shell with the option --remote "https://localhost "

B.

Execute their pyspark shell with the option --remote "sc://localhost"

C.

Set the environment variable SPARK_REMOTE="sc://localhost" before starting the pyspark shell

D.

Add .remote("sc://localhost") to their SparkSession.builder calls in their Spark code

E.

Ensure the Spark property spark.connect.grpc.binding.port is set to 15002 in the application code

Question 15

A developer runs:

What is the result?

Options:

Options:

A.

It stores all data in a single Parquet file.

B.

It throws an error if there are null values in either partition column.

C.

It appends new partitions to an existing Parquet file.

D.

It creates separate directories for each unique combination of color and fruit.

Question 16

55 of 55.

An application architect has been investigating Spark Connect as a way to modernize existing Spark applications running in their organization.

Which requirement blocks the adoption of Spark Connect in this organization?

Options:

A.

Debuggability: the ability to perform interactive debugging directly from the application code

B.

Upgradability: the ability to upgrade the Spark applications independently from the Spark driver itself

C.

Complete Spark API support: the ability to migrate all existing code to Spark Connect without modification, including the RDD APIs

D.

Stability: isolation of application code and dependencies from each other and the Spark driver

Question 17

Given:

python

CopyEdit

spark.sparkContext.setLogLevel("")

Which set contains the suitable configuration settings for Spark driver LOG_LEVELs?

Options:

A.

ALL, DEBUG, FAIL, INFO

B.

ERROR, WARN, TRACE, OFF

C.

WARN, NONE, ERROR, FATAL

D.

FATAL, NONE, INFO, DEBUG

Question 18

44 of 55.

A data engineer is working on a real-time analytics pipeline using Spark Structured Streaming.

They want the system to process incoming data in micro-batches at a fixed interval of 5 seconds.

Which code snippet fulfills this requirement?

Options:

A.

query = df.writeStream \

.outputMode("append") \

.trigger(processingTime="5 seconds") \

.start()

B.

query = df.writeStream \

.outputMode("append") \

.trigger(continuous="5 seconds") \

.start()

C.

query = df.writeStream \

.outputMode("append") \

.trigger(once=True) \

.start()

D.

query = df.writeStream \

.outputMode("append") \

.start()

Question 19

A Data Analyst needs to retrieve employees with 5 or more years of tenure.

Which code snippet filters and shows the list?

Options:

A.

employees_df.filter(employees_df.tenure >= 5).show()

B.

employees_df.where(employees_df.tenure >= 5)

C.

filter(employees_df.tenure >= 5)

D.

employees_df.filter(employees_df.tenure >= 5).collect()

Question 20

A data engineer is building a Structured Streaming pipeline and wants the pipeline to recover from failures or intentional shutdowns by continuing where the pipeline left off.

How can this be achieved?

Options:

A.

By configuring the option checkpointLocation during readStream

B.

By configuring the option recoveryLocation during the SparkSession initialization

C.

By configuring the option recoveryLocation during writeStream

D.

By configuring the option checkpointLocation during writeStream

Question 21

An engineer wants to join two DataFrames df1 and df2 on the respective employee_id and emp_id columns:

df1: employee_id INT, name STRING

df2: emp_id INT, department STRING

The engineer uses:

result = df1.join(df2, df1.employee_id == df2.emp_id, how='inner')

What is the behaviour of the code snippet?

Options:

A.

The code fails to execute because the column names employee_id and emp_id do not match automatically

B.

The code fails to execute because it must use on='employee_id' to specify the join column explicitly

C.

The code fails to execute because PySpark does not support joining DataFrames with a different structure

D.

The code works as expected because the join condition explicitly matches employee_id from df1 with emp_id from df2

Question 22

A data engineer wants to create an external table from a JSON file located at /data/input.json with the following requirements:

Create an external table named users

Automatically infer schema

Merge records with differing schemas

Which code snippet should the engineer use?

Options:

Options:

A.

CREATE TABLE users USING json OPTIONS (path '/data/input.json')

B.

CREATE EXTERNAL TABLE users USING json OPTIONS (path '/data/input.json')

C.

CREATE EXTERNAL TABLE users USING json OPTIONS (path '/data/input.json', mergeSchema 'true')

D.

CREATE EXTERNAL TABLE users USING json OPTIONS (path '/data/input.json', schemaMerge 'true')

Question 23

39 of 55.

A Spark developer is developing a Spark application to monitor task performance across a cluster.

One requirement is to track the maximum processing time for tasks on each worker node and consolidate this information on the driver for further analysis.

Which technique should the developer use?

Options:

A.

Broadcast a variable to share the maximum time among workers.

B.

Configure the Spark UI to automatically collect maximum times.

C.

Use an RDD action like reduce() to compute the maximum time.

D.

Use an accumulator to record the maximum time on the driver.

Question 24

A data scientist is working on a large dataset in Apache Spark using PySpark. The data scientist has a DataFrame df with columns user_id, product_id, and purchase_amount and needs to perform some operations on this data efficiently.

Which sequence of operations results in transformations that require a shuffle followed by transformations that do not?

Options:

A.

df.filter(df.purchase_amount > 100).groupBy("user_id").sum("purchase_amount")

B.

df.withColumn("discount", df.purchase_amount * 0.1).select("discount")

C.

df.withColumn("purchase_date", current_date()).where("total_purchase > 50")

D.

df.groupBy("user_id").agg(sum("purchase_amount").alias("total_purchase")).repartition(10)

Question 25

36 of 55.

What is the main advantage of partitioning the data when persisting tables?

Options:

A.

It compresses the data to save disk space.

B.

It automatically cleans up unused partitions to optimize storage.

C.

It ensures that data is loaded into memory all at once for faster query execution.

D.

It optimizes by reading only the relevant subset of data from fewer partitions.

Question 26

Given the following code snippet in my_spark_app.py:

What is the role of the driver node?

Options:

A.

The driver node orchestrates the execution by transforming actions into tasks and distributing them to worker nodes

B.

The driver node only provides the user interface for monitoring the application

C.

The driver node holds the DataFrame data and performs all computations locally

D.

The driver node stores the final result after computations are completed by worker nodes

Question 27

11 of 55.

Which Spark configuration controls the number of tasks that can run in parallel on an executor?

Options:

A.

spark.executor.cores

B.

spark.task.maxFailures

C.

spark.executor.memory

D.

spark.sql.shuffle.partitions

Question 28

A data engineer needs to persist a file-based data source to a specific location. However, by default, Spark writes to the warehouse directory (e.g., /user/hive/warehouse). To override this, the engineer must explicitly define the file path.

Which line of code ensures the data is saved to a specific location?

Options:

Options:

A.

users.write(path="/some/path").saveAsTable("default_table")

B.

users.write.saveAsTable("default_table").option("path", "/some/path")

C.

users.write.option("path", "/some/path").saveAsTable("default_table")

D.

users.write.saveAsTable("default_table", path="/some/path")

Question 29

What is the relationship between jobs, stages, and tasks during execution in Apache Spark?

Options:

Options:

A.

A job contains multiple stages, and each stage contains multiple tasks.

B.

A job contains multiple tasks, and each task contains multiple stages.

C.

A stage contains multiple jobs, and each job contains multiple tasks.

D.

A stage contains multiple tasks, and each task contains multiple jobs.

Question 30

What is the benefit of Adaptive Query Execution (AQE)?

Options:

A.

It allows Spark to optimize the query plan before execution but does not adapt during runtime.

B.

It enables the adjustment of the query plan during runtime, handling skewed data, optimizing join strategies, and improving overall query performance.

C.

It optimizes query execution by parallelizing tasks and does not adjust strategies based on runtime metrics like data skew.

D.

It automatically distributes tasks across nodes in the clusters and does not perform runtime adjustments to the query plan.

Question 31

45 of 55.

Which feature of Spark Connect should be considered when designing an application that plans to enable remote interaction with a Spark cluster?

Options:

A.

It is primarily used for data ingestion into Spark from external sources.

B.

It provides a way to run Spark applications remotely in any programming language.

C.

It can be used to interact with any remote cluster using the REST API.

D.

It allows for remote execution of Spark jobs.

Question 32

A Spark developer is building an app to monitor task performance. They need to track the maximum task processing time per worker node and consolidate it on the driver for analysis.

Which technique should be used?

Options:

A.

Use an RDD action like reduce() to compute the maximum time

B.

Use an accumulator to record the maximum time on the driver

C.

Broadcast a variable to share the maximum time among workers

D.

Configure the Spark UI to automatically collect maximum times

Question 33

16 of 55.

A data engineer is reviewing a Spark application that applies several transformations to a DataFrame but notices that the job does not start executing immediately.

Which two characteristics of Apache Spark's execution model explain this behavior? (Choose 2 answers)

Options:

A.

Transformations are executed immediately to build the lineage graph.

B.

The Spark engine optimizes the execution plan during the transformations, causing delays.

C.

Transformations are evaluated lazily.

D.

The Spark engine requires manual intervention to start executing transformations.

E.

Only actions trigger the execution of the transformation pipeline.

Question 34

A DataFrame df has columns name, age, and salary. The developer needs to sort the DataFrame by age in ascending order and salary in descending order.

Which code snippet meets the requirement of the developer?

Options:

A.

df.orderBy(col("age").asc(), col("salary").asc()).show()

B.

df.sort("age", "salary", ascending=[True, True]).show()

C.

df.sort("age", "salary", ascending=[False, True]).show()

D.

df.orderBy("age", "salary", ascending=[True, False]).show()

Question 35

32 of 55.

A developer is creating a Spark application that performs multiple DataFrame transformations and actions. The developer wants to maintain optimal performance by properly managing the SparkSession.

How should the developer handle the SparkSession throughout the application?

Options:

A.

Use a single SparkSession instance for the entire application.

B.

Avoid using a SparkSession and rely on SparkContext only.

C.

Create a new SparkSession instance before each transformation.

D.

Stop and restart the SparkSession after each action.

Question 36

28 of 55.

A data analyst builds a Spark application to analyze finance data and performs the following operations:

filter, select, groupBy, and coalesce.

Which operation results in a shuffle?

Options:

A.

filter

B.

select

C.

groupBy

D.

coalesce

Question 37

49 of 55.

In the code block below, aggDF contains aggregations on a streaming DataFrame:

aggDF.writeStream \

.format("console") \

.outputMode("???") \

.start()

Which output mode at line 3 ensures that the entire result table is written to the console during each trigger execution?

Options:

A.

AGGREGATE

B.

COMPLETE

C.

REPLACE

D.

APPEND

Question 38

A data analyst builds a Spark application to analyze finance data and performs the following operations: filter, select, groupBy, and coalesce.

Which operation results in a shuffle?

Options:

A.

groupBy

B.

filter

C.

select

D.

coalesce

Question 39

A data engineer writes the following code to join two DataFrames df1 and df2:

df1 = spark.read.csv("sales_data.csv") # ~10 GB

df2 = spark.read.csv("product_data.csv") # ~8 MB

result = df1.join(df2, df1.product_id == df2.product_id)

Which join strategy will Spark use?

Options:

A.

Shuffle join, because AQE is not enabled, and Spark uses a static query plan

B.

Broadcast join, as df2 is smaller than the default broadcast threshold

C.

Shuffle join, as the size difference between df1 and df2 is too large for a broadcast join to work efficiently

D.

Shuffle join because no broadcast hints were provided

Question 40

Given this code:

.withWatermark("event_time", "10 minutes")

.groupBy(window("event_time", "15 minutes"))

.count()

What happens to data that arrives after the watermark threshold?

Options:

Options:

A.

Records that arrive later than the watermark threshold (10 minutes) will automatically be included in the aggregation if they fall within the 15-minute window.

B.

Any data arriving more than 10 minutes after the watermark threshold will be ignored and not included in the aggregation.

C.

Data arriving more than 10 minutes after the latest watermark will still be included in the aggregation but will be placed into the next window.

D.

The watermark ensures that late data arriving within 10 minutes of the latest event_time will be processed and included in the windowed aggregation.