Home

Snowflake Interview Questions Set-05

ARCHITECTURE:

Explain Snowflake’s unique multi-cluster shared data architecture.
Snowflake utilizes a hybrid of shared-nothing and shared-disk architectures with three distinct layers:

  • Database Storage Layer: Data is automatically organized into compressed, columnar, micro-partitions (50MB-160MB in size) in cloud storage. This layer is shared by all compute resources.
  • Query Processing Layer (Virtual Warehouses): These are independent compute clusters that execute queries. They are not tied to specific storage and can be scaled up (vertical scaling for power) or out (horizontal scaling/multi-cluster for concurrency) dynamically without contention.
  • Cloud Services Layer: This layer coordinates all activities, including authentication, access control (RBAC), metadata management, query optimization, and transaction management. It uses a globally managed service to ensure consistency across all virtual warehouses.

How does the separation of compute and storage benefit performance and cost management?

This separation allows for independent scaling. You can scale up compute resources for high-performance processing without increasing storage costs, and vice-versa. This also enables different workloads (e.g., ETL, BI, ad-hoc analytics) to run on separate virtual warehouses without competing for resources, optimizing performance and allowing for granular cost control through features like auto-suspend and resume.

Explain the different types of caching in Snowflake and their role in performance optimization.

Snowflake employs several caching mechanisms:

  • Result Cache: Stores the results of every query run in the last 24 hours. If an identical query is submitted, the result is returned from the cache without any compute usage, provided the underlying data hasn’t changed.
  • Local Disk Cache: Data used by running queries is cached in memory and on local SSDs of the virtual warehouse to speed up subsequent queries that use the same data.
  • Remote Disk Cache: Caches data for a longer term, managed by the storage layer to speed up access for future queries across different warehouses.

Advanced Concepts & Scenarios

How do you architect Snowflake for 1,000+ concurrent users without degrading performance?
The key is to leverage multi-cluster warehouses to handle the high concurrency. By isolating different workloads (e.g., reporting, ad-hoc analysis) to dedicated warehouses with appropriate auto-scaling policies, you can ensure predictable performance for all users. Maximizing the use of the result cache for repetitive BI queries also significantly reduces compute usage and improves response times.

What is a Clustering Key, and when would you use it?

A clustering key (a defined set of columns) physically co-locates data with similar values within the same micro-partitions, improving pruning efficiency for large tables when queries filter or join on those columns. It is primarily useful for very large tables where query performance is critical and automatic clustering isn’t sufficient.

Explain the difference between Snowflake Time Travel and Fail-safe.

  • Time Travel: Allows access to historical data (modified or deleted) within a configurable retention period (default 1 day, up to 90 days for Enterprise Edition). It’s used for auditing, cloning historical data, and self-service data recovery via SQL commands.
  • Fail-safe: Provides an additional, non-configurable 7-day data recovery window that begins after the Time Travel retention period expires. It is for disaster recovery and requires Snowflake support to recover the data.

How do you implement row-level and column-level security in Snowflake for PII data?

Security is implemented using dynamic data masking for column-level security and row access policies for row-level security. These policies use Role-Based Access Control (RBAC) to apply masking or filtering based on the user’s role and data tags, ensuring security is managed centrally and consistently.
Describe how you would design a near-real-time Change Data Capture (CDC) pipeline.
A common approach uses Snowflake Streams and Tasks. A stream tracks data changes (inserts, updates, deletes) in a source table, and a task, acting as a built-in scheduler, orchestrates the process of applying these changes to a target table. Snowpipe can be integrated to handle continuous file ingestion from external stages.
How does Snowflake handle data ingestion without the use of a traditional ETL tool?
Snowflake provides several built-in mechanisms:
COPY INTO: For batch loading data from internal or external stages. It can handle large data volumes efficiently.
Snowpipe: A continuous, event-based data ingestion service that automatically loads new data files as they arrive in cloud storage (like AWS S3 or Azure Blob Storage).
Snowpark: Developers can write Python, Java, or Scala code to perform complex transformations within Snowflake, leveraging its compute power for ELT processes.
CASHE:
1. What are the different types of caching mechanisms in Snowflake?
Snowflake uses three main types of caching:

Result Cache: Stores the results of every query executed in the last 24 hours. If the same query is run again and underlying data hasn’t changed, the result is returned without using compute resources.
Warehouse Cache (Local Disk/SSD): As data is retrieved from the storage layer, it is cached in the compute layer (virtual warehouse) SSD. It remains active as long as the warehouse is running.
Metadata Cache: The cloud services layer caches table metadata, including table file locations and micro-partition metadata, speeding up queries that do not require raw data scanning.
2. How long does the Result Cache persist?
The Result Cache persists for 24 hours. If a query is repeated within this timeframe and the underlying data in the tables has not changed, the result is reused. The 24-hour retention period restarts if the same query is executed again.

3. If the underlying table data is updated, what happens to the Result Cache?
If the underlying table data is modified (e.g., via INSERT, UPDATE, DELETE), the entries in the Result Cache for that table are invalidated. The next time the query is run, it will compute the new results.
4. Does the Warehouse Cache persist if the warehouse is suspended?
No, the Warehouse Cache (local SSD cache) is lost when a virtual warehouse is suspended or resized. When the warehouse resumes, the cache is cold, and the next queries will take longer as they fetch data from the remote storage layer.
5. How can you take advantage of the Warehouse Cache?
To maximize the benefits of the Warehouse Cache, it is best to keep a warehouse running if it is serving frequent, repeated queries, or to group similar workloads to run on the same warehouse.

6. What is the role of Metadata Cache?
The Metadata Cache is located in the Cloud Services layer. It stores information such as file locations, table statistics, and micro-partition metadata. This allows queries that only scan metadata (e.g., SELECT COUNT(*) on a table) to run instantly without starting a virtual warehouse.

7. How do you bypass the Result Cache?
You cannot directly turn off the Result Cache. However, if you are benchmarking query performance, you can use the ALTER SESSION SET USE_CACHED_RESULT = FALSE; command to bypass it for your current session, forcing the query to use the warehouse cache or storage layer.

8. Which cache is faster: Result Cache or Warehouse Cache?
The Result Cache is faster, as it skips the execution phase entirely. The Warehouse Cache is still very fast but requires the query to be executed by the warehouse, although it avoids fetching data from remote storage.

9. Do all users share the same Result Cache?
Yes, the Result Cache is shared across users, provided they have the necessary privileges to access the data and are running the exact same query.

10. What is the impact of table clustering on caching?
TABLES:
Properly clustered tables improve the effectiveness of the Warehouse Cache by ensuring that only the relevant micro-partitions are loaded into the cache.
Q: Explain how data is stored in Snowflake?
A: Data is stored in micro-partitions, which are small, columnar, compressed, and encrypted files managed entirely by Snowflake. This architecture allows for efficient pruning, high performance, and automatic management.
Q: What is the difference between Permanent, Transient, and Temporary tables?
A:
Permanent: Default, highest data protection (90-day time travel + fail-safe).
Transient: No fail-safe, lower retention (0-1 day time travel), used for ETL staging.
Temporary: Session-specific, dropped automatically when the session ends.
Q: What is “Zero-Copy Cloning” and when would you use it?
A: It creates a copy of a database, schema, or table instantly without physically copying data, which saves storage costs. It is used for creating development/testing environments from production data.
Q: How do you optimize query performance in Snowflake?
A: 1. Utilize micro-partition pruning. 2. Implement Clustering Keys for large tables. 3. Monitor Query Profile to identify bottlenecks. 4. Choose the right size for the Virtual Warehouse. 5. Use materialized views for pre-computation.
Q: Explain “Time Travel” and “Fail-Safe” in Snowflake?
A: Time Travel allows querying/restoring data from a past point (up to 90 days for Enterprise). Fail-safe provides an additional 7-day period for historical data recovery, accessible only by Snowflake support, following the Time Travel period.
Q: What are the different types of caching in Snowflake?
A: 1. Metadata Cache (table row counts, file locations). 2. Query Result Cache (results of executed queries). 3. Warehouse Cache (data loaded from remote storage to compute nodes).
Q: Describe the 3 layers of Snowflake architecture?
A: 1. Database Storage: Holds data in columnar format (S3/Azure Blob). 2. Query Processing: Virtual Warehouses (compute) perform queries. 3. Cloud Services: Coordinates, manages security, and caches metadata.
Scenario-Based Questions
Q: How do you handle changing data requirements (Schema Evolution) in Snowflake?
A: Use the COPY INTO command with MATCH_BY_COLUMN_NAME to map CSV/JSON data to table columns, and use ALTER TABLE to add new columns for schema evolution.
Q: How do you share data securely with another Snowflake account?
A: Utilize Snowflake Data Sharing (Secure Views/Tables) to share data without copying it, eliminating data movement
TABLES:

 

TIMETRAVEL:
What is Snowflake Time Travel?
Time Travel is a powerful feature in Snowflake that allows you to access historical data (that has been modified or deleted) at any point within a defined data retention period. It is used for querying past data, restoring dropped objects (tables, schemas, databases), and cloning data from specific points in time.
What is the default data retention period?
The standard default data retention period for all permanent tables, schemas, and databases is one day (24 hours).
Can the retention period be increased? What is the maximum?
Yes, for accounts on the Enterprise Edition (or higher), the maximum data retention period for permanent objects can be increased up to 90 days. For Standard Edition accounts, the maximum is one day.
Do all table types support Time Travel and Fail-safe?
No.
Permanent tables support both Time Travel (up to 90 days in Enterprise, 1 day in Standard) and an additional 7-day Fail-safe period.
Transient tables are similar to permanent tables but have no Fail-safe period, only the configurable Time Travel retention (0 or 1 day).
Temporary tables exist only for the duration of the user session and have no Fail-safe period; their data retention is limited to the session’s lifespan or a maximum of 1 day.
External tables have 0 days of retention and do not support Time Travel or Fail-safe.
How can you disable Time Travel for a table, and what happens when you do?
You can disable Time Travel by setting the DATA_RETENTION_TIME_IN_DAYS to 0 using the ALTER TABLE command.
Example: ALTER TABLE my_table SET DATA_RETENTION_TIME_IN_DAYS = 0;
When disabled, historical data will be moved directly into the Fail-safe zone once modified, if applicable for the table type.
Accessing Historical Data (Querying)
How do you query historical data using Time Travel?
You can use the AT or BEFORE keywords with specific time references (timestamp, offset, or query ID) in a SQL query.
There are three primary methods:
Using a timestamp:
sql
SELECT * FROM my_table AT(TIMESTAMP => ‘2025-01-01 12:00:00’::TIMESTAMP);

Using an offset (relative time in seconds): To query data 5 minutes ago:
sql
SELECT * FROM my_table AT(OFFSET => -5 * 60);
Using a Query ID: To see the state of a table immediately before a specific DML statement (e.g., a problematic update) was executed:
sql
SELECT * FROM my_table BEFORE(STATEMENT => ‘8e0c11d0-0000-0000-0000-000000000000’);
Where can you find the Query ID to use in a Time Travel query?
Query IDs can be found in the Query History section of the Snowflake UI, which stores information for the last 14 days. You can query the QUERY_HISTORY table function in the Information Schema or Account Usage views to find the specific ID.
Data Recovery Scenarios
You accidentally dropped a production table. How do you recover it?
You can use the UNDROP TABLE command to restore a dropped table as long as it is within the data retention period.
Example: UNDROP TABLE dropped_table_name;
What if you dropped a table, and then a new table was created with the same name? Can you still use UNDROP?
No, the UNDROP command will fail because an object with the same name already exists. To recover the original, dropped table, you would first need to rename the new table, and then run the UNDROP command on the original table name.
Example:
sql
ALTER TABLE current_table_name RENAME TO new_name_temp;
UNDROP TABLE original_table_name;

How would you restore a table to a previous state after a faulty DML operation (e.g., an incorrect update statement)?
You can use a combination of Time Travel and Zero-Copy Cloning to restore the table.
Identify the state of the table before the faulty operation using an AT or BEFORE query.
Create a new, corrected table from that historical data:
sql
CREATE OR REPLACE TABLE my_table_restored AS
SELECT * FROM my_table AT(TIMESTAMP => ‘timestamp_before_error’::TIMESTAMP);

If you need to replace the original table, rename the original to a backup name and the restored table to the original name. This leverages zero-copy cloning for a fast and storage-efficient recovery.
What is the difference between Time Travel and Fail-safe?
Time Travel is a user-configurable feature (0 to 90 days) designed for standard data recovery, auditing, and analysis by end-users via SQL queries.
Fail-safe is a non-configurable, 7-day data recovery feature that kicks in after the Time Travel retention period expires. It is primarily a disaster recovery mechanism managed internally by Snowflake, and data recovery through Fail-safe requires Snowflake Support involvement and is not guaranteed to be instantaneous.
Cost and Storage
How does Time Travel affect storage costs?
Snowflake charges for the storage of historical data retained during both the Time Travel and Fail-safe periods. Data is stored in a compressed, optimized, columnar format (micro-partitions). Tracking storage usage for Time Travel is important for cost management and can be monitored via account usage views.
Does Zero-Copy Cloning within the Time Travel window cost more storage?
No, Zero-Copy Cloning creates a new table, schema, or database without duplicating the underlying physical data at the time of creation. Both the original and the clone share the same micro-partitions. You only incur additional storage costs when changes (DML operations) are made to either the original or the cloned object.
CLONE:
1. What is Zero-Copy Cloning and its advantages?
Answer: It is a metadata-only operation that creates a clone of a table, schema, or database instantly. It does not create a physical copy of the data files (micro-partitions).
Advantages: Instant cloning, zero storage cost for the initial clone, and efficient data sharing between environments.
2. Can you perform DML operations on a cloned table, and how does it affect storage?
Answer: Yes, DML operations (INSERT, UPDATE, DELETE) are fully supported on a cloned table.
Storage Impact: When data is modified in the clone, only the new or changed micro-partitions are stored, which incur storage costs. The unchanged data continues to point to the original table’s micro-partitions.
3. Does cloning a database also clone the stages or external tables?
Answer: No. Cloning only covers data within Snowflake tables, schemas, and databases. It does not include data in external stages or external table definitions themselves.
4. What is the difference between CLONE and COPY?
Answer: CLONE is instantaneous and creates a pointer-based copy (zero-copy), requiring no immediate extra storage. COPY (e.g., CREATE TABLE AS SELECT) physically duplicates all data, consuming time and storage.
5. What are the key limitations of cloning?
Answer:
It does not clone internal stages.
It does not clone pipe definitions (though it can clone the table the pipe feeds).
If the source object is dropped, the clone continues to exist independently.
6. Can you clone a table at a specific point in time?
Answer: Yes, using Time Travel. For example:
sql
CREATE TABLE table_clone_restore
CLONE table_original
AT (TIMESTAMP => ‘2023-01-01 12:00:00’::timestamp_tz);
This is highly useful for recovering data from a specific point in timE
7. What happens if I clone a table and then the original table is dropped?
Answer: The cloned table becomes independent of the original table. It will continue to function, and the data it pointed to will not be removed by the dropping of the original table.
8. How to clone a database using SQL?
Answer: CREATE DATABASE new_db_clone CLONE existing_db;
9. What is the difference between cloning and creating a Time Travel snapshot?
Answer: Cloning creates a new, persistent object that exists until dropped, regardless of the retention period of the source. Time Travel (AT or BEFORE clause) is a temporary, query-based snapshot that only works within the data retention period (0-90 days).
10. How does zero-copy cloning relate to Time Travel?
Answer: Cloning allows you to clone a table as it existed at any point within its Time Travel retention period, allowing for rapid recovery of dropped or damaged tables.
STAGES AND LOADING:
Q1: Differentiate between internal and external stages. When would you choose one over the other?
Internal Stages: These are secure storage locations managed entirely within Snowflake’s cloud storage, including user, table, and named stages. They are ideal when:
Security is a top priority, as data is encrypted at rest by default.
You need a simple, self-contained solution without managing external cloud storage accounts and access policies.
You are loading data from a local file system using the PUT command.
External Stages: These reference data files stored in external cloud storage like AWS S3, Azure Blob Storage, or Google Cloud Storage. They are preferred when:
Data is already residing in the cloud storage as part of an existing ETL/ELT pipeline.
Large volumes of data need to be loaded and shared across different platforms or accounts.
You want to query the data directly from the stage using external tables without loading it into Snowflake tables first.
Q2: Explain the different data loading methods in Snowflake. For a near real-time ingestion requirement, which method is most suitable?
Snowflake offers five main data loading methods:
COPY INTO <table>: A SQL command used for bulk loading of data from staged files into a table. This is best for scheduled batch loading.
Snowpipe (Auto-Ingest or REST API): A continuous data ingestion service that loads data within minutes of files being added to a stage. It’s event-driven and serverless.
Snowpipe Streaming (via Kafka or Client SDK): Used for row-by-row data ingestion with lower latency than standard Snowpipe.
COPY with Serverless Tasks: Allows using serverless compute to run COPY INTO commands on a schedule.
For a near real-time ingestion requirement, Snowpipe (Auto-Ingest) is the most suitable method. It eliminates the need for manual batch loads and uses event notifications (e.g., S3 event notifications) to trigger the loading process automatically, providing fast data access.

Q3: How do you handle “bad” or error records during a large data load using COPY INTO?
During a COPY INTO operation, you can use error handling options in the command syntax to manage bad records:

ON_ERROR = CONTINUE: Skips the erroneous file or row and continues loading the rest of the valid data. The errors are recorded and can be viewed later.
ON_ERROR = SKIP_FILE: Skips the entire file if a single error is detected.
ON_ERROR = ABORT_STATEMENT: Aborts the entire loading operation upon encountering the first error (default behavior).
For experienced users, the best practice is to use ON_ERROR = CONTINUE and leverage the VALIDATION_MODE parameter or the COPY_HISTORY and LOAD_HISTORY views in the Information Schema to analyze and correct the bad records after the load is complete.

Q4: What are the best practices for optimizing data loading performance in Snowflake?
Optimization for experienced professionals focuses on architecture and file management:

File Sizing: Ensure data files in the stage are optimally sized, typically between 100 MB and 250 MB (compressed), to balance ingestion performance and metadata overhead. Avoid loading a large number of very small files, which can degrade performance and increase costs.
Batching: Orchestrate file arrival in the staging area into micro-batches for efficient processing, instead of triggering an event for every single tiny file.
Data Compression: Always compress your data files (e.g., GZIP, Snappy) before staging to reduce storage, data transfer costs, and load times.
Warehouse Sizing: Use an appropriately sized virtual warehouse for the COPY INTO command. Loading large volumes of data may benefit from a larger warehouse size for faster completion.
Partitioning: Organize external stage files using a logical folder structure (e.g., based on date or region) to enable efficient file pruning during the COPY INTO process
Stages are holding areas for data files before they are loaded into Snowflake tables.

External Stages: Reference files stored in cloud providers like AWS S3, Azure Blob, or Google Cloud Storage.
Internal Stages: Managed entirely by Snowflake.
User Stages: Specific to each user; cannot be shared.
Table Stages: Created automatically for every table.
Named Internal Stages: Created manually; offer the most flexibility for multiple users/tables.

Experienced-Level Interview Questions & Answers
1. Data Loading Strategy
Q: When would you use Snowpipe versus bulk loading with the COPY command?
A: Use Snowpipe for continuous, event-driven ingestion where data arrives in small, frequent files and needs near real-time availability. Use COPY INTO for large batches (GBs of data) that are scheduled at specific intervals (e.g., nightly ETL) to maximize warehouse efficiency.
2. Performance Optimization
Q: How do you handle file sizing and parallel processing during a load?
A: For optimal performance, aim for file sizes between 100MB and 250MB (compressed). Snowflake can then distribute these files across the nodes of a Virtual Warehouse for parallel processing. Loading a single massive 1TB file is far slower than loading 4,000 files of 250MB each.
3. Error Handling
Q: How do you handle partially corrupted files during a bulk load without failing the entire job?
A: Use the ON_ERROR parameter in the COPY INTO command.
CONTINUE: Skip the error and continue loading the rest.
SKIP_FILE: Skip the entire file if any error is found.
ABORT_STATEMENT (Default): Fails the entire load.
4. Semi-Structured Data
Q: What is the best practice for loading JSON data into Snowflake?
A: Load the raw JSON into a column with the VARIANT data type. This allows you to store the data without a pre-defined schema. You can later flatten or parse it using SQL dot notation for analytics.
5. Advanced Stage Management
Q: How can you access data in an S3 bucket without using an AWS Secret Key in your SQL code?
A: Use a Storage Integration object. This is a more secure method that uses IAM roles to establish trust between Snowflake and AWS, removing the need for hardcoded credentials in CREATE STAGE commands.
COPY INTO (Batch):
 Best for large, scheduled loads (e.g., nightly ETL)
. It allows for complex transformations during ingestion, such as column reordering and casts
.
Snowpipe (Continuous):
 Ideal for near-real-time ingestion from stages
. It is serverless, reacting to cloud event notifications (S3, GCS, Azure) within minutes
.
Snowpipe Streaming (Next-Gen):
 Bypasses cloud object storage to ingest rowsets directly from sources like Kafka, offering latencies as low as 5 seconds


Performance Tuning & Scenario-Based Troubleshooting
Performance in Snowflake loading is driven by parallelism and file optimization.
The 100-250MB Sweet Spot
Snowflake parallelizes work based on the number of files
Small Files (< 10MB):
 Create massive metadata overhead
.
Massive Files (> 100GB):
 Cannot be easily parallelized across warehouse nodes
.
Optimal Strategy:
 Aim for 100MB to 250MB (compressed) per file
Scenario Questions
Q: You see 90% CPU I/O on a warehouse running a load; how do you fix it?
A:
 Assign a dedicated warehouse for that specific load operation to isolate it from other analytical workloads
.
Q: A Storage Integration fails even though the IAM role is correct. Why?
A:
 If 
CREATE OR REPLACE STORAGE INTEGRATION
 was used, the hidden internal ID changed. Stages linked to the old ID will fail until updated with the new integration object
.
Q: Does Snowpipe guarantee order of loading?
A:
 No. Files are loaded concurrently and asynchronously; if ordering is required, use event timestamps within the data itself
Error Handling Strategies
Using
VALIDATION_MODE
is preferred for pre-flight checks before actual ingestion
VALIDATION_MODE = ‘RETURN_ALL_ERRORS’
: Scans the stage and reports all errors without loading data
.
ON_ERROR = CONTINUE
: Skips the bad rows and loads the rest, preventing a single row from failing a multi-GB load
.

.

Actionable Tips for Your Interview
Know the syntax: Be prepared to write a basic COPY INTO statement or a CREATE PIPE command on a whiteboard.
Mention Cost: Mention that Snowpipe uses serverless compute, whereas COPY uses a Virtual Warehouse you manage.
Pruning: Explain how Micro-partitions and metadata help skip irrelevant data during the loading and querying process.
MASKING POLICY:
Q: What is dynamic data masking in Snowflake?
A: Dynamic data masking is a column-level security feature that hides sensitive data in real-time at query execution, based on the user’s role or other conditions. The underlying data in the table is not changed (unlike static masking).
Q: How do you implement a simple dynamic masking policy?
A: You define a SQL-based masking policy as a schema-level object, which includes conditional logic (e.g., using CURRENT_ROLE() or a custom entitlement table). Then, you apply the policy to a specific column in a table or view using the ALTER TABLE/VIEW … MODIFY COLUMN statement.
Q: Can you provide an example of a masking policy for an email address?
A: A policy might use a CASE statement: if the user’s role is ANALYST, show the full email; otherwise, show a masked value like ‘****@****.com’.
Q: What is the difference between column-level security and row-level security?
A: Column-level security (masking policies) controls access to specific columns in a table, while row-level security (row access policies) controls which rows are visible to a user.
Q: How does a masking policy interact with other data access policies, such as a row access policy?
A: A column cannot be used as an argument in a row access policy if it is already masked by a masking policy. This means you must carefully design your data governance strategy to avoid conflicts and ensure proper data protection.
Q: How can you use tag-based masking policies to manage security at scale?
A: Instead of applying policies to individual columns, you can create a tag, assign the masking policy to that tag, and then apply the tag to multiple columns across different tables. This simplifies management and ensures consistency.
Q: A user with the PUBLIC role is querying a masked column and seeing the masked data, but an authorized user with a specific ANALYST role should see the unmasked data. How do you verify the implementation?
A: You would switch between roles using the USE ROLE command and execute the query to confirm the expected outcome. The ANALYST role should be granted the necessary privileges within the masking policy definition.
Q: What happens if you try to apply two different masking policies to the same column?
A: Snowflake only allows a single masking policy to be applied to a column at any given time. Attempting to apply a second one will result in an error.
Q: How does Snowflake’s architecture support dynamic data masking?
A: Dynamic data masking is enforced in the cloud services layer at query compilation/runtime, without modifying the underlying data in the storage layer. The metadata manager tracks which policies apply to which columns, allowing the query optimizer to return masked or unmasked results based on the user’s context efficiently.
Q: How do you monitor the usage and effectiveness of your masking policies?
A: You can query the Information Schema views, such as MASKING_POLICIES and POLICY_REFERENCES, to track which policies are active, which columns they affect, and audit access patterns to ensure security measures are effective.

STREAMS:
Snowflake Streams interviews focus on Change Data Capture (CDC) workflows, performance optimization, and integration with tasks, emphasizing “one stream per consumer” rules. Key topics include Metadata columns, SYSTEM$STREAM_HAS_DATA, and handling high-volume CDC scenarios efficiently.

Advanced Snowflake Streams Interview Questions
1. What are the key metadata columns present in a Snowflake Stream?
Answer: A stream includes the original table columns plus three metadata columns:
METADATA$ACTION: Indicates DML action (INSERT, DELETE).
METADATA$ISUPDATE: A boolean indicating if the row was part of an UPDATE statement.
METADATA$ROW_ID: A unique, immutable ID for the row.
2. Explain the best practice for using multiple consumers on a single table.
Answer: The best practice is “one stream, one consumer”. Multiple streams can be created on the same table, but if multiple downstream consumers need the same data, it is recommended to create separate streams for each to avoid coordination issues and ensure that each consumer processes all data, as consuming from a stream advances its offset.


3. How do you check if a stream has data before running a task?
Answer: I use the built-in function SYSTEM$STREAM_HAS_DATA(‘stream_name’). This returns a boolean value (true or false), allowing me to skip task execution if the stream is empty, saving costs.
4. How do you handle CDC when a table is truncated?
Answer: A TRUNCATE statement on the source table is registered as a DELETE in the stream. However, depending on the volume, it may appear as a large volume of deleted rows. I would ensure the downstream processing can handle high-volume deletions.
5. Can a stream be created on a View or a Materialized View?
Answer: Yes, streams can be created on standard views and materialized views to track data changes.
6. What happens if a stream is not consumed for a long time?
Answer: If the stream’s offset falls outside the retention period of the source table’s Time Travel (maximum 14 days), the stream becomes stale, and an error will be thrown when querying it.
7. How do you handle schema changes (e.g., adding a column) on a table with an active stream?
Answer: Snowflake streams are resilient to schema changes. If a column is added to the source table, the new column will automatically appear in the stream in subsequent transactions.
8. How do you implement “upsert” logic using streams and tasks?
Answer:
Create a stream on the source table.
Create a task that runs on a schedule.
Within the task, use a MERGE statement that consumes from the stream, updating existing records (METADATA$ISUPDATE = TRUE) and inserting new ones (METADATA$ACTION = ‘INSERT’).
9. Can we clone a stream?
Answer: Yes, streams can be cloned using the CREATE TABLE … CLONE command. However, the clone will start with the same offset as the original stream.
10. What is the difference between a table-stream and a view-stream?
Answer: Table streams track all DML changes (INSERT, UPDATE, DELETE). View streams track changes to the underlying tables, but they do not support METADATA$ISUPDATE for updates in all cases.
Key Concepts to Review
CDC (Change Data Capture): Streams are the foundational component of CDC in Snowflake.
Data Retention: Understanding how stream staleness interacts with Time Travel.
Monitoring: Using SHOW STREAMS to check the statu


1. Advanced Streams & CDC (Change Data Capture)
Can you create multiple streams on a single table?
Answer: Yes, you can create multiple streams on a single table. This is useful when different downstream consumers (e.g., separate ETL tasks) need to consume the same change data independently.
What are the metadata fields in a Snowflake Stream?
Answer: A stream adds three columns to the source data: METADATA$ACTION (INSERT/DELETE), METADATA$ISUPDATE (TRUE if part of an update), and METADATA$ROW_ID (unique row identifier).
How do you handle a scenario where a stream is empty?
Answer: Use the SYSTEM$STREAM_HAS_DATA function. This is critical for Task scheduling to prevent unnecessary warehouse costs when no new data is present.
2. Performance & Optimization at Scale
How does micro-partition pruning differ from traditional indexing?
Answer: Snowflake doesn’t use traditional indexes. Instead, it uses micro-partitioning and stores metadata (min/max values) for each column per partition. The query optimizer uses this metadata to “prune” (skip) irrelevant partitions, significantly speeding up queries.
When should you explicitly define a Clustering Key?
Answer: While Snowflake clusters data automatically, a manual clustering key is recommended for tables typically larger than several TBs where query performance has degraded despite pruning.
How would you investigate a long-running query?
Answer: Use the Query Profile in the Snowflake UI to identify bottlenecks like “exploding joins,” high spilling to remote storage, or inefficient pruning.
3. Scenario-Based & Architectural Questions
You need to migrate 10TB of data with minimal downtime. What is your approach?
Answer: Perform a bulk load using the COPY INTO command from an External Stage (e.g., S3). Use a larger Virtual Warehouse (X-Large or higher) for the initial load and ensure files are split into optimal sizes (100MB–250MB compressed).
What is the difference between a Temporary and Transient table?
Answer:
Temporary: Persists only for the duration of the session; no Fail-safe period.
Transient: Persists until explicitly dropped but does not have a Fail-safe period, saving on storage costs for non-critical data.
Explain “Zero-Copy Cloning” and its cost implications.
Answer: Cloning creates a new metadata-only pointer to existing micro-partitions. It costs nothing initially; you only pay for storage when data in either the original or the clone is modified (new micro-partitions are created).
4. Security & Governance
How do you implement Row-Level Security (RLS)?
Answer: Use Row Access Policies. These are schema-level objects that use a mapping table or conditional logic to determine which rows a user can see based on their role.
What is a “Secure View” and when is it used?
Answer: Secure views prevent users from seeing the underlying query definition or data that was filtered out, which is vital for multi-tenant environments where data privacy is paramount.

TASK::
1. What are Snowflake Tasks and their main components?
Answer: A Task is a Snowflake object that allows for the scheduling of a single SQL statement, including CALL to a stored procedure or a COPY INTO command.


Key Components:
Schedule: Defines frequency (e.g., USING CRON).
Action: The SQL to execute.
Warehouse: The compute resource to run the task (can be a user-managed virtual warehouse or serverless).
Condition: Optional (WHEN) to check if the task should run based on stream data.
2. Explain the Difference Between Standalone Tasks and Directed Acyclic Graphs (DAGs).
Answer:
Standalone Task: A single task running on a schedule (e.g., CREATE TASK my_task SCHEDULE = ’60 MINUTE’ …).
DAG (Task Tree): A series of dependent tasks where one task acts as the root and others follow (AFTER).
Key Aspect: A child task begins immediately upon successful completion of its predecessor, rather than waiting for a schedule. This is crucial for managing complex, sequential ETL pipelines.
3. How can you ensure a task only runs when new data arrives? (Scenario-based)
Answer: I would use a Snowflake Stream in combination with the task WHEN condition.

Workflow:
Create a Stream on the source table: CREATE STREAM my_stream ON TABLE raw_table;
Create the task with a condition:
sql
CREATE TASK my_task
WAREHOUSE = my_wh
SCHEDULE = ‘5 MINUTE’
WHEN SYSTEM$STREAM_HAS_DATA(‘my_stream’)
AS
CALL my_stored_procedure();

Benefit: This avoids unnecessary costs by not running the warehouse if the stream is empty.
4. What are the limitations of Tasks, and how do you overcome them?
Answer:
Limitation 1: A single task can only execute one SQL statement.
Solution: Use a stored procedure (CALL) to encapsulate multiple SQL statements or complex logic, then call that procedure in the task.
Limitation 2: Tasks can fail silently if not monitored.
Solution: Use TASK_HISTORY table function to monitor, and implement error handling within stored procedures.
Limitation 3: Maximum 1000 tasks in a DAG.
Solution: For very complex pipelines, break them into smaller, independent DAGs.
5. What are the differences between Serverless Tasks and User-Managed Tasks?
Answer:
User-Managed (Virtual Warehouse): You specify a WAREHOUSE in the CREATE TASK statement. You have direct control over the size (X-Small to 6X-Large), but you pay for the warehouse uptime (plus auto-suspend time).
Serverless: You omit the WAREHOUSE parameter and use USER_TASK_MANAGED_INITIAL_WAREHOUSE_SIZE. Snowflake automatically selects compute resources and scales them.
Use Case: Serverless is best for unpredictable, lightweight, or intermittently running tasks, as you pay only for the exact seconds used (rounded up to 1 second).
6. How do you monitor task failures, and what is your approach to restarting a failed task tree?
Answer:
Monitoring: I query TABLE(INFORMATION_SCHEMA.TASK_HISTORY()) or ACCOUNT_USAGE.TASK_HISTORY to filter for STATE = ‘FAILED’. I also configure alert notifications using SYSTEM$SET_SNOWFLAKE_NOTIFICATION_INTEGRATION to receive emails upon failure.
Restarting: If a node in a DAG fails, the subsequent tasks do not run. After fixing the issue, I use ALTER TASK <child_task> RESUME; to restart, or if it is a large tree, I may need to re-run the root task manually using EXECUTE TASK.
7. What is TASK_BEFORE_TRIGGER_OFFSET?
Answer: In a DAG, if a child task is supposed to run after a parent, but the parent task is scheduled infrequently, TASK_BEFORE_TRIGGER_OFFSET (in seconds) can be used to set a maximum allowed latency for the child task to start after the parent finishes, ensuring data isn’t delayed.
8. Explain the difference between SUSPEND and DROP in task management.
Answer:
SUSPEND: Temporarily stops a task from running on its schedule, but the task object (and its dependencies in a DAG) remains.
DROP: Permanently deletes the task object from the database.
9. A task is running too long and exceeding the warehouse usage limit. How would you optimize this?
Answer:
Analyze Query Profile: Check the Query Profile to identify bottlenecks (e.g., spilling to disk, inefficient joins).
Optimize Source/Target: Ensure the target table is correctly clustered and that I am only processing new data (using streams).
Use WHEN clause: Ensure the task only runs if there is actually data in the stream, reducing wasted compute.
Increase Warehouse Size: For a temporary, high-volume data load, I might increase the warehouse size to reduce the run time (compute-cost, but lower in duration).
Re-cluster: Ensure the table has proper cluster keys to reduce scanning.
10. How can you run a task immediately without waiting for the scheduled time?
Answer: I can use the command EXECUTE TASK <task_name>;. This is useful for testing or for manual ad-hoc execution.

What are Snowflake Tasks and how do they differ from CRON jobs?
Tasks are first-class Snowflake objects used to schedule the execution of a single SQL statement or a Stored Procedure. Unlike traditional CRON jobs, they are integrated into the Cloud Services Layer, allowing them to trigger based on schedules or dependencies (Directed Acyclic Graphs).
Explain the “Serverless” vs. “User-Managed” Task model.
User-Managed: You specify a Virtual Warehouse to run the task. You pay for the warehouse uptime.
Serverless: Snowflake manages the compute resources automatically. This is ideal for lightweight tasks where maintaining a warehouse is not cost-effective.
Scenario-Based & Advanced Questions
How do you handle Task dependencies (DAGs)?
You use the AFTER keyword to chain tasks. For example, CREATE TASK child_task AFTER parent_task. A root task can have up to 1000 child tasks, creating a Task Tree.
What is the “Task History” and how do you use it for troubleshooting?
You query the TASK_HISTORY table function in the Information Schema to monitor execution status (SUCCEEDED, FAILED, or SKIPPED) and identify long-running tasks.
How do you prevent multiple instances of the same task from overlapping?
Snowflake prevents this by default; a new instance of a task will not start if a previous instance is still running. However, you can manage this via the ALLOW_OVERLAPPING_EXECUTION parameter.
How do you implement Change Data Capture (CDC) using Tasks and Streams?
A Stream tracks DML changes (inserts, updates, deletes) on a table. A Task can then be scheduled to run a WHEN SYSTEM$STREAM_HAS_DATA(‘stream_name’) condition to process those changes only when new data is available.
Error Alerting: Always configure Error Notification Integration using Amazon SNS, Azure Event Grid, or GCP Pub/Sub to get real-time alerts on task failures.
State Management: Remember that newly created tasks are SUSPENDED by default. You must run ALTER TASK <name> RESUME to start them.
Cost Optimization: Use Serverless Tasks for small, frequent executions to avoid the 60-second minimum charge of a dedicated virtual warehouse.
FUNCTIONS
UDF vs. Stored Procedure (SP): This is the #1 question.
The Answer: A UDF is for computation and returning a value (Scalar or Table). It can be used directly in a SELECT statement. An SP is for administrative tasks (DML, DDL). It can return a value but cannot be used in a SELECT clause; it must be called via CALL.
Scalar UDF vs. Tabular UDF (UDTF):
Scalar: Returns one value per row (e.g., converting a currency).
UDTF: Returns a set of rows for each input row.
Pro Tip: Mention that UDTFs are essential when using the LATERAL FLATTEN logic to parse complex JSON.

2. External Functions & Security
This demonstrates you’ve worked in enterprise environments where Snowflake needs to talk to the “outside world.”
What are External Functions and how do they work?
Explain the flow: Snowflake → API Gateway (AWS/Azure/GCP) → Remote Service (Lambda/Cloud Function) → Result back to Snowflake.
Follow-up: “How do you secure an external function?” Mention IAM Roles and API Keys within the API Integration object.
Secure UDFs: Why use the SECURE keyword?
Scenario: You have a UDF used in a Data Share. Without SECURE, an inquisitive consumer could potentially use “filter-based attacks” to infer data they shouldn’t see by looking at query profile statistics. Secure UDFs prevent this by ensuring the optimizer doesn’t reorder operations in a way that leaks data.

3. Handling Semi-Structured Data
How do you flatten a nested JSON array in a single query?
Answer: Use the FLATTEN table function. Explain the difference between this, value, and key in the output of a flatten.
System Functions for JSON: Be ready to discuss CHECK_JSON, PARSE_JSON, and GET_PATH.

4. Performance & Optimization (The “Senior” Tier)
This is where you distinguish yourself from a junior developer.
UDF Performance Issues: If a JavaScript UDF is running slowly, how do you optimize it?
The “Secret”: Mention that JavaScript UDFs can be slower due to the overhead of the V8 engine context switching. If possible, rewrite them in SQL UDFs (which are inlined by the optimizer) or Java/Python UDFs (which can handle larger workloads better).
Determinism: What is a VOLATILE vs. IMMUTABLE function?
Immutable: Returns the same result for the same input (Snowflake can cache this).
Volatile: Result changes even with same input (e.g., CURRENT_TIMESTAMP). Using a volatile function in a WHERE clause can sometimes prevent partition pruning.


5. Coding & Languages
Which languages can you use for UDFs?
SQL, JavaScript, Java, Python, and Scala.
Scenario: “When would you choose a Python UDF over a SQL UDF?”
Answer: Python is superior for complex logic, string manipulation that goes beyond Regex, or when you need to use third-party libraries (like numpy or pandas) via Snowpark
STORED PROCEDURE:
Q: When would you use a stored procedure versus a User-Defined Function (UDF) in Snowflake?
A: Stored procedures are used for complex logic, data validation, and performing DML (Data Manipulation Language) operations that make permanent environmental changes (e.g., INSERT, UPDATE, DELETE). They don’t return a value directly but can return status codes or results from an executed query. UDFs, conversely, compute a scalar value or a set of values, cannot perform DML, and can be used inline within SQL statements like any other function.
Q: How do you handle errors and exceptions within a Snowflake stored procedure?
A: Snowflake stored procedures written in Javascript (the primary language) use try…catch blocks for robust error handling. You can use the throw statement to raise an exception and the getException() method within the catch block to retrieve details like the error code and message. Mentioning you log these errors to a dedicated audit table using a separate stored procedure demonstrates experience.
Q: Explain how to manage transactions within a stored procedure.
A: Transactions are crucial for data integrity. You should explain using BEGIN TRANSACTION, COMMIT, and ROLLBACK statements to ensure that a series of SQL operations succeed or fail as a single unit of work. For example, if one UPDATE fails, the previous ones are rolled back in the catch block.
Q: How do you use dynamic SQL within a stored procedure, and what are the security considerations?
A: Dynamic SQL allows you to build SQL statements at runtime, which is useful for tasks like dynamic table names or schema manipulation. In Snowflake, this is typically done using the snowflake.createStatement() method in Javascript and executing it. The key security consideration is SQL injection. The answer should emphasize using bind variables or the IDENTIFIER() function to safely pass object names or values, preventing malicious code execution.
AWS & Integration-Specific Questions
Q: Describe how you would set up a data ingestion pipeline from AWS S3 into Snowflake using a stored procedure and Snowpipe.
A: The process involves creating an external stage that points to the AWS S3 bucket. You then set up a Snowpipe to automatically ingest data when new files land in S3, leveraging S3 event notifications. The stored procedure can be scheduled via a Snowflake Task Scheduler to call the COPY INTO command or related Snowpipe API to process the staged files, perhaps orchestrating a data quality check after ingestion.
Q: How do you pass parameters from an AWS service (e.g., Lambda or Step Functions) to a Snowflake stored procedure?
A: The standard method is to use one of the Snowflake connectors (e.g., Python or JDBC drivers) within the AWS service. The AWS service connects to Snowflake and executes the stored procedure using the CALL statement, passing the required parameters as arguments.
Q: How would you use Snowflake tasks and stored procedures to orchestrate a complex, dependent workflow in an AWS environment?
A: You can define a series of Snowflake tasks that execute stored procedures. Tasks can have a defined schedule and dependencies, forming a directed acyclic graph (DAG) of operations, directly within Snowflake, reducing reliance on external orchestrators like Airflow if the logic is self-contained.
Optimization & Best Practices
Q: What are some best practices for writing efficient stored procedures in Snowflake?
A:
Minimize DML operations: Use a single large INSERT or UPDATE rather than many small ones (singleton operations are inefficient in OLAP systems).
Leverage Snowflake features: Use COPY INTO, zero-copy cloning, and Time Travel features when appropriate.
Appropriate warehouse sizing: Ensure the virtual warehouse running the procedure is correctly sized for the workload to optimize performance and cost.
Monitor performance: Use the Query History to analyze query profiles and identify bottlenecks.
Modularize code: Break complex logic into smaller, reusable stored procedures or UDFs.
Q: How do you handle schema drift in stored procedures when integrating with external data sources like AWS S3?
A: You can use features like the MATCH_BY_COLUMN_NAME option in the COPY INTO statement to handle minor column order changes. For significant schema changes, the stored procedure may need to check the incoming file’s schema first (perhaps using the INFER_SCHEMA feature) and dynamically adjust the target table or use a schema evolution approach.
Q: How does the “execute as caller” versus “execute as owner” setting impact stored procedures and security in Snowflake?
A:
Execute as Caller: The procedure runs with the privileges of the user who calls it. This is useful for general utility procedures.
Execute as Owner: The procedure runs with the privileges of the role that owns the procedure, regardless of the caller’s privileges. This is vital for implementing a security model where users can perform specific actions (like data loading or updates) through the procedure without having direct, low-level access to the underlying tables, which helps prevent SQL injection and enforce access control.
MICROPARTITION CLUSTER BY KEY:
Q: What are micro-partitions, and how do they differ from traditional partitioning?
A: Micro-partitions are immutable, automatically generated storage units (50MB to 500MB uncompressed) that store data in a columnar format. Unlike traditional partitioning where you manually define partition keys and boundaries, micro-partitions are created transparently based on data insertion order, requiring no explicit administrative action from the user. Snowflake manages their creation and optimization entirely in the background.
Q: How does Snowflake use micro-partitions for query performance?
A: Snowflake stores metadata (min/max values, row counts, etc.) for each micro-partition. When a query is run, the query optimizer uses this metadata to perform “pruning,” which means it identifies and only scans the micro-partitions relevant to the query’s filter conditions, significantly reducing the amount of data read and improving performance.
Clustering Key Implementation & Strategy
Q: When is it appropriate to define a clustering key on a table?
A: You should define a clustering key for very large tables (multi-terabyte scale) where the natural loading order doesn’t align with the most frequent query filter columns. It’s particularly useful when queries filter on columns with high cardinality or a wide range of values that are not naturally ordered in the data load.
Q: How do you choose an effective clustering key?
A: Select columns that are most frequently used in WHERE clauses and JOIN conditions. The key should ideally have a good balance of cardinality to ensure data is sorted into non-overlapping micro-partitions, avoiding excessive DML costs from re-clustering. Avoid clustering on columns with very low cardinality (which doesn’t help with pruning) or extremely high cardinality (which can be expensive to maintain).
Q: How does Snowflake’s Automatic Clustering service work?
A: When a clustering key is defined, the “Automatic Clustering” service runs in the background. It asynchronously rewrites micro-partitions to physically group rows with similar values for the clustering columns into the same or adjacent micro-partitions. This process is managed by Snowflake and consumes Snowflake credits.
Monitoring & Troubleshooting
Q: How do you monitor the clustering status and performance of a table?
A: You can use the SYSTEM$CLUSTERING_DEPTH and SYSTEM$CLUSTERING_RATIO functions, or query the AUTOMATIC_CLUSTERING_HISTORY view. A lower clustering depth indicates a better-clustered table. Ultimately, query performance is the best indicator.
Q: What steps would you take if a clustered table’s query performance degrades over time?
A: I would first monitor the clustering depth and the number of overlapping micro-partitions using the system functions to confirm if the table is indeed no longer well-clustered. The data ingestion patterns may have changed, causing new data to be scattered. I might then re-evaluate the clustering key to ensure it still aligns with current query patterns or allow the automatic clustering service time to recluster the new data.
Scenario-Based Questions
Q: Describe a situation where you had to optimize a slow-running query related to micro-partitions.
A: I would describe a scenario where queries filtering on a specific date column were slow. After checking the data load pattern, I found the data was ingested in a non-sequential order. By adding a clustering key on that event_date column, the automatic clustering service reorganized the data, allowing for efficient micro-partition pruning and significantly improving query performance.
Q: What are the trade-offs of using clustering keys?
A: The primary benefit is improved query performance and reduced scan times (I/O costs) for queries that filter on the clustering key. The trade-off is the cost associated with the automatic clustering service (credit consumption) and the increased DML (Data Manipulation Language) processing time as data needs to be re-clustered after inserts, updates, or deletes. You should weigh these costs against the query performance gains.
EXTERNAL TABLE AND DYNAMIC TABLE:
Core Concept Questions
Q: What is a Snowflake External Table?
A: An external table is a schema object in Snowflake that references data files located in an external stage (like AWS S3, Azure Blob, or GCP cloud storage) rather than in Snowflake’s internal storage. It allows you to query data directly in its native format without first loading it into a Snowflake-managed table.
Q: What is a Snowflake Dynamic Table?
A: A dynamic table is a type of table that simplifies the creation of data pipelines by declaratively defining the result of a query. Snowflake automatically and incrementally refreshes the table’s contents to reflect changes in the source data, enabling near-real-time data processing without manual scheduling of tasks or complex ETL procedures.
Key Differences and Comparisons
Q: Explain the fundamental difference in data storage and management between external and dynamic tables.
A: The key difference is data location and lifecycle management.
External Tables: The data resides externally in cloud storage and is managed outside of Snowflake. Snowflake only stores metadata about the files.
Dynamic Tables: The data is stored internally within Snowflake’s optimized columnar format, and its lifecycle and refresh are fully managed by Snowflake’s automated, serverless process.
Q: When would you choose an external table over a dynamic table, and vice-versa?
A:
External Tables are ideal for scenarios like:
Querying data in a data lake without ingestion to avoid data duplication or immediate storage costs.
Performing initial data exploration and validation before deciding on a full ingestion strategy.
Cases where the source data is managed by an external process and only needs occasional querying from Snowflake.
Dynamic Tables are suitable for scenarios where:
You need to build reliable, continuous data pipelines and data transformations within Snowflake itself.
You require a simple, declarative way to maintain up-to-date, transformed data for downstream analytics or reporting.
You want Snowflake to manage the refresh logic and dependencies automatically.
Q: Do external tables support DML operations (INSERT, UPDATE, DELETE)? What about dynamic tables?
A:
External Tables do not support DML operations; they are essentially read-only. Data manipulation must be done at the source files in the external stage.
Dynamic Tables also do not directly support DML operations in the same way as a standard table, as their content is defined by a query, but they automatically reflect changes (inserts, updates, deletes) from their source tables through the automated refresh process.
Operational and Scenario-Based Questions
Q: How do you handle schema drift or data freshness in an external table?
A: External tables require manual intervention or automated processes (like Snowpipe with auto-ingest) to keep their metadata in sync with the underlying files. If files are added, removed, or modified, you often need to run ALTER EXTERNAL TABLE … REFRESH to update the metadata.
Q: How does Snowflake manage the refresh process for dynamic tables?
A: Dynamic tables use a serverless compute model to automatically detect changes in the base tables and incrementally update the results. You can define a target lag (how out of date the data can be), and Snowflake handles the scheduling and compute resources needed to meet that lag, without requiring explicit tasks or warehouses to be specified for the refresh itself.
Q: What are the cost considerations for using external versus dynamic tables?
A:
External Tables: Incur storage costs for the external cloud storage (managed outside Snowflake) and compute costs for queries that access the external data (which can be less performant than querying internal tables because of remote access overhead).
Dynamic Tables: Incur standard Snowflake storage costs (data is stored internally) and serverless compute costs for the automatic background refreshes. You pay for the actual data storage and the compute resources used during the refresh process
External Tables: Interview Questions & Answers
What are External Tables and when should they be used over internal tables?
External Tables are Snowflake objects that reference data stored in external cloud storage (S3, Azure Blob, GCS) without moving it into Snowflake. They are ideal for querying data directly from a data lake for ad-hoc analysis, initial data exploration, or as a source for ELT pipelines.
How can you improve query performance on External Tables?
Since External Tables do not store data in Snowflake’s optimized micro-partitions, performance is slower. You can optimize them by:
Partitioning: Use the PARTITION BY clause to prune files during scans.
Metadata Cache: Enable AUTO_REFRESH = TRUE via cloud notifications (e.g., SQS) to keep the metadata current.
File Formats: Use columnar formats like Parquet or ORC for better pruning.
Explain the difference between an External Table and a Stage.
A Stage is just a reference to a storage location used for loading/unloading files (via COPY INTO). An External Table is a schema-on-read object that allows you to run SQL queries directly against those files as if they were in a database.

Dynamic Tables: Interview Questions & Answers
What are Dynamic Tables and how do they differ from Materialized Views?
Dynamic Tables (DTs) are used for building declarative data pipelines.
Logic: DTs can contain complex queries with joins and unions, whereas Materialized Views are typically restricted to single-table projections/aggregations.
Updates: DTs refresh based on a defined Target Lag (e.g., 1 minute), whereas Materialized Views are always synchronized with the base table.
What is “Target Lag” and how does it impact cost?
Target Lag is the maximum amount of time the dynamic table’s data can trail the source data. A shorter lag (e.g., 1 minute) leads to more frequent refreshes, consuming more compute credits from the dedicated Dynamic Table Service.
Can Dynamic Tables be based on other Dynamic Tables?
Yes. This is a primary use case for building multi-level data pipelines (e.g., Bronze to Silver to Gold) without external orchestration like Airflow. Snowflake manages the refresh dependency tree automatically.


Comparison & Scenario-Based Questions
Scenario: You need near-real-time data for a dashboard with complex joins. Which would you choose?
Dynamic Tables are the best choice here because they handle complex joins and automate the refresh cycle based on your desired latency (lag).
When would you combine External Tables with Dynamic Tables?
You can use an External Table as the source for a Dynamic Table. This allows you to “ingest” data from a data lake into Snowflake’s optimized storage incrementally, applying transformations as part of the DT’s refresh cycle.
VIEW AND DATA SHARE:
Q: Differentiate between a standard view and a materialized view in Snowflake. When would you choose one over the other?
A: A standard view is a stored query that is executed every time it is referenced, providing up-to-date results but potentially consuming more compute for complex queries. A materialized view (MV) is a pre-computed dataset stored for faster access, automatically maintained by Snowflake.
When to use: Use standard views for frequently changing data where real-time results are critical and query complexity is manageable. Use materialized views for repetitive, resource-intensive analytical queries on relatively stable base tables where performance is the priority and the underlying data doesn’t change frequently.
Q: Explain the purpose of a SECURE VIEW. How does it work, and what are the performance implications?
A: A SECURE VIEW prevents users from seeing the underlying query definition or the data from the base tables if they don’t have direct access, thus protecting sensitive data and intellectual property. Snowflake adds a layer of access control and obfuscation.
Performance: Secure views can be slower than standard views because Snowflake performs additional access control checks and cannot use some query optimization techniques like result caching as effectively.
Q: How do you handle schema evolution in the base tables of a view? What happens if a column is added or dropped?
A: Standard views in Snowflake are metadata-only. If a column is added to the base table, the view will generally still work and will not expose the new column unless redefined. If a column is dropped that the view explicitly uses, the view will break. Materialized views require a refresh or recreation if the underlying table structure changes significantly.
Q: Describe how you would use ZERO-COPY CLONING in conjunction with views for testing or development purposes.
A: Zero-copy cloning allows you to create instant, writable copies of databases, schemas, or tables without duplicating the storage. You can clone a production schema to a development schema, including the views. Developers can then modify or create new views on the cloned tables (which initially point to the same storage) without impacting the production environment or incurring extra storage costs.
Snowflake Data Sharing Interview Questions
Q: Explain the concept of Data Shares in Snowflake. What makes it unique compared to traditional data sharing methods?
A: A data share is a secure mechanism to share a database, schema, or specific objects with other Snowflake accounts (called consumers). The key difference is that no actual data is duplicated or transferred. The shared data remains in the provider’s storage, and the consumer accesses it using their own compute resources, ensuring a single source of truth and instant access.
Q: How do you implement secure data sharing with a partner organization using Snowflake?
A:
Create a Share: Use the CREATE SHARE command to define the share object.
Grant Access: Grant privileges on specific databases, schemas, tables, or secure views to the share.
Add Consumer Account(s): Add the partner’s Snowflake account(s) to the share.
Consumer Actions: The consumer creates a database from the share, which their users can then access via role-based access control (RBAC).
Q: What is the Snowflake Data Exchange/Marketplace, and how does it relate to data sharing?
A: The Snowflake Data Marketplace is a platform where data providers can list and share data products (datasets, data services) with a broad audience of consumers. It leverages the underlying secure data sharing technology to provide a seamless, real-time data access experience. Data Exchange is a private version of the Marketplace for a specific organization’s internal data sharing.
Q: Can a consumer of a shared table perform DML operations (UPDATE, DELETE, INSERT)?
A: No, shared databases and all the objects within them are read-only for the consumer account. This design choice ensures data integrity and prevents consumers from modifying the provider’s source data.
Q: How can you implement column-level security on a shared table?
A: You can use a SECURE VIEW that projects only the non-sensitive columns, and then share the secure view instead of the base table. Alternatively, for a more robust approach, you can implement dynamic data masking policies on the sensitive columns of the base table. When the consumer accesses the data through the share, the masking policy will apply based on their role and context, masking the sensitive data dynamically.
Snowflake Views: Interview Questions & Answers
What is the difference between a Standard View and a Materialized View?
A Standard View is a virtual table that executes its underlying query dynamically every time it is called. A Materialized View is a precomputed data set stored by Snowflake for faster retrieval, which is especially useful for repetitive, complex aggregations. While standard views are free to maintain, materialized views incur compute costs for automatic maintenance as base table data changes.
What is a Secure View, and when would you use it?
A Secure View is designed for data privacy; it prevents users from seeing the underlying query definition or using certain optimization techniques that might expose data. You use it to implement column-level security or row-level access control, ensuring that sensitive data is only visible to authorized roles.
Scenario: You need to improve the performance of a view that aggregates millions of rows. What steps do you take?
First, evaluate if it can be a Materialized View. If not, ensure the base tables are properly clustered on the columns used in filters or joins. Check the query profile to identify bottlenecks like “exploding joins” or unnecessary subqueries that could be consolidated.
Can you create a view on top of a Shared Table?
Yes, but if you want to share that view back out, it must be a Secure View.

Data Sharing: Interview Questions & Answers
How does Snowflake’s Data Sharing differ from traditional data movement?
Traditional sharing requires physically moving or copying data (ETL/FTP). Snowflake Data Sharing is metadata-driven; no actual data is copied or duplicated. The “Provider” account grants access to its live data to “Consumer” accounts, who see the changes in real-time.
Who pays for the costs in a Data Share?
The Provider is charged for the storage of the shared data. The Consumer is charged for the compute (virtual warehouse) used to query that shared data.
What is a Reader Account, and when is it necessary?
A Reader Account is a Snowflake account created by a Provider for consumers who do not have their own Snowflake account. The Provider pays for both storage and the compute used by these accounts.
Can you share data across different cloud regions or platforms?
Standard Data Sharing works within the same region. To share across regions (e.g., AWS US-East to Azure West-Europe), you must first replicate the database to a linked account in the target region.
Scenario: A partner organization needs to see only a specific subset of your data. How do you implement this?
I would create a Secure View that filters the data based on the partner’s requirements, then add only that specific view to the Data Share rather than sharing the entire base table.
ROW ACESS POLICY
Q: What is a Snowflake Row Access Policy?
A: A Row Access Policy is a schema-level security object that defines row-level access control to a table or view. When a policy is attached to a table, Snowflake evaluates the policy’s SQL expression for each row in the table at query time to determine which rows should be returned in the query result based on the querying user’s role or other conditions.
Q: How do Row Access Policies fit into Snowflake’s overall security framework?
A: They are a core part of Snowflake’s data governance capabilities, working alongside role-based access control (RBAC), data masking policies, and object tagging to ensure data security and compliance.
Q: What is the difference between Row Access Policies and secured views?
A: Row Access Policies are a more streamlined, centralized approach to row-level security that can be managed on the table itself, rather than creating and managing separate secured views for each access scenario, which can become complex.
Implementation and Usage
Q: How do you create and apply a Row Access Policy in Snowflake?
A: A policy is created using CREATE ROW ACCESS POLICY, which includes a SQL expression that typically uses a mapping table or a custom function to determine access. It is then applied to a table using the ALTER TABLE … ADD ROW ACCESS POLICY command.
Q: Can a single table have multiple Row Access Policies?
A: No, a single table or view can only have one Row Access Policy attached to it. However, the logic within that single policy can be complex and reference multiple conditions or mapping tables.
Q: What happens if a user queries a table protected by a policy but doesn’t have the necessary role or permissions?
A: If the policy determines the user does not have access to specific rows, those rows are simply filtered out from the query result, and the user will not know those rows exist. The user will only see the data they are authorized to view.
Q: How can you use system functions within a Row Access Policy?
A: You can use system functions like CURRENT_ROLE() or CURRENT_USER() to dynamically filter rows based on the identity or role of the user executing the query. This enables dynamic data security policies.
Limitations and Best Practices
Q: What are some limitations of Snowflake Row Access Policies?
A: Limitations include:
The CHANGES clause on a view protected by a policy is not supported.
External tables cannot be used as mapping tables in a policy.
You cannot apply a policy directly to a stream object, but it applies to the underlying table the stream accesses.
Q: How does a Row Access Policy impact query performance?
A: The policy adds a filtering step to query execution. Performance can be optimized by ensuring the conditions in the policy leverage a well-clustered column or a highly selective filter, which helps with Snowflake’s micro-partition pruning.
Q: What are the best practices for implementing Row Access Policies?
A: Key best practices include:
Centralize policy management using mapping tables rather than hardcoding logic.
Limit the complexity of the SQL expression within the policy function to ensure optimal performance.
Thoroughly test policies with different user roles to ensure correct data visibility and prevent unintended data exposure.
Grant the APPLY ROW ACCESS POLICY privilege to custom roles to manage policy application effectively.
JSON AND FILEFORMAT:
Q: How does Snowflake support and handle semi-structured data like JSON?
A: Snowflake has native support for semi-structured data using the VARIANT, ARRAY, and OBJECT data types. It stores this data in an optimized, compressed, columnar format (micro-partitions), similar to structured data, allowing for efficient storage and querying without a fixed schema at load time. This approach differs from other data warehouses like Redshift, which might split JSON into strings, making it harder to query.
Q: How do you load JSON data into a Snowflake table?
A: Data is first staged in an internal or external stage (e.g., AWS S3, Azure Blob Storage). The COPY INTO command is then used to load the data into a table with a VARIANT column. You specify the file format options in the COPY INTO statement. For continuous, automated ingestion, Snowpipe can be used with event notifications.
Q: How do you query specific elements within a JSON VARIANT column?
A: You can query elements using standard SQL syntax with a colon (:) or dot (.) notation for path traversal, along with the GET functions (e.g., GET_PATH, GET_DDL) and type casting. For example, SELECT data:user.name::string FROM my_table would extract the ‘name’ field as a string.
Q: What is the FLATTEN function, and when would you use it?
A: The FLATTEN function is a table function that explodes semi-structured data (VARIANT, OBJECT, or ARRAY) into a relational representation. It’s useful when you need to flatten nested JSON structures to perform joins or aggregate data more efficiently, treating array elements or object key-value pairs as individual rows.
Q: What are the performance implications of storing and querying JSON data in Snowflake?
A: While Snowflake handles JSON natively and efficiently due to its columnar storage, heavily querying deeply nested data or a very large VARIANT column can have performance implications compared to querying well-structured, normalized relational data. For frequently accessed nested elements, a common practice is to extract them into dedicated relational columns using views or during the load process for better performance and micro-partition pruning.
File Formats in Snowflake
Q: What file formats are natively supported by Snowflake for data loading?
A: Snowflake supports a variety of file formats, including CSV, JSON, Avro, Parquet, ORC, and XML.
Q: How do you define a specific file format for an ingestion pipeline?
A: You can define a file format using the CREATE FILE FORMAT DDL statement, specifying options like TYPE (e.g., JSON), COMPRESSION, FIELD_DELIMITER, SKIP_HEADER, etc.. This named file format object can then be referenced in the COPY INTO statement or a stage definition to ensure consistent loading parameters.
Q: Explain the difference between an internal and an external stage.
A:
Internal stages are secure storage locations managed entirely within Snowflake, ideal for staging sensitive data or data that does not need to be accessed by external tools.
External stages reference data files stored in external cloud storage (e.g., AWS S3, Google Cloud Storage, Azure Blob Storage). This is common for existing data lakes and large-scale, automated ingestion pipelines using Snowpipe event notifications.
Q: How does Snowflake’s handling of file formats impact performance or cost?
A: Snowflake automatically compresses data, so users are charged based on the final compressed size, which helps reduce storage costs. Using efficient binary formats like Parquet (which is columnar and compressed) can optimize load times and query performance due to features like predicate pushdown and micro-partition pruning. Defining file formats correctly is crucial to avoid loading errors and ensure efficient data parsing.

DBT
MODELS:
Q1: How do you handle high-volume data, and which materialization is best in Snowflake?
Answer: I prefer incremental materialization using {Link: Snowflake’s Merge https://www.visualcv.com/snowflake-interview-questions/} strategy to update only new/changed records, significantly reducing compute costs compared to full table rebuilds.
Q2: Explain how to optimize dbt models for Snowflake performance.
Answer: I use cluster_by in the config to define clustering keys on frequently filtered columns (e.g., dates) to improve query performance via micro-partition pruning.
Q3: How do you handle slowly changing dimensions (SCDs) in dbt?
Answer: I use dbt snapshot for Type 2 SCDs, as it automatically tracks changes over time, or I create an incremental model with merge logic if custom logic is required.
Q4: How do you implement Incremental Models and avoid duplicates?
Answer: I define a unique key in the configuration and use a where clause to filter new records (e.g., where date > (select max(date) from {{ this }})).
Q5: What are the best practices for using ref() versus source()?
Answer: source() refers to raw data loaded by ingestion tools, allowing for data freshness checks. ref() refers to other dbt models, creating dependencies and enabling lineage.
Jinja & Dynamic SQL
Q6: Describe a scenario where you used Jinja macros in dbt.
Answer: I created a macro to dynamically generate CASE statements for complex calculations across multiple columns, avoiding code repetition, or to handle data type conversion across different source systems.
Q7: What is an ephemeral materialization and when would you use it?
Answer: Ephemeral models are not created in the database; they are CTEs injected into dependent models. I use them for lightweight transformations that are only used once, reducing clutter in the schema.
Snowflake & dbt Configuration
Q8: How do you manage different environments (Dev, QA, Prod)?
Answer: I use profiles.yml to define target databases/schemas based on the target environment (–target prod or –target dev) and use environment variables for sensitive credentials.
Q9: What is dbt seed and how do you use it?
Answer: dbt seed loads CSV files (mapping tables, lookup data) from the seeds directory into Snowflake. It is ideal for small, static datasets.
Q10: How do you ensure high-quality data in your models?
Answer: I use standard tests (unique, not_null, relationship) and custom tests in schema.yml, plus I use the dbt-expectations package for advanced validations.

CI/CD & Optimization
Q11: How do you implement CI/CD for dbt?
Answer: In GitHub Actions, I run dbt build –selector state:modified to only run modified models and their downstream dependencies, ensuring fast CI/CD cycles.
Q12: How do you handle schema evolution (new columns) in dbt models?
Answer: I use on_schema_change=’sync_all_columns’ in the incremental model configuration to automatically add new columns to the target table without breaking the pipeline.
SNAPSHOT:
1. What is a dbt Snapshot and why is it used in Snowflake?
Answer: A snapshot in dbt is a mechanism to implement Slowly Changing Dimension Type 2 (SCD2) functionality. In Snowflake, it is used to track changes in a source table over time, enabling historical reporting, such as knowing what a customer’s address was six months ago, even if it has changed since.

ProjectPro +4

2. How do you implement a snapshot in dbt?
Answer: Snapshots are SQL files placed in the snapshots directory. They must contain a snapshot configuration block defining the target database/schema, a unique key, and the strategy (timestamp or check).
sql
{% snapshot orders_snapshot %}
{{ config(target_schema=’snapshots’, unique_key=’id’, strategy=’timestamp’, updated_at=’updated_at’) }}
select * from {{ source(‘raw’, ‘orders’) }}
{% endsnapshot %}

YouTube +4

3. Explain timestamp vs. check strategy in snapshots.
Answer:
Timestamp: Used when the source table has a column that updates whenever a record changes (e.g., updated_at, last_modified).
Check: Used when there is no reliable timestamp, but you want to check specific columns for changes. dbt compares the current value with the last snapshot value.

YouTube +3

4. How does a Snapshot handle updates in Snowflake?
Answer: When a row changes, the snapshot invalidates the old record by setting the dbt_valid_to column to the current timestamp and inserts a new record with the updated data and dbt_valid_from set to the current time. The dbt_valid_to of the current record is NULL.

YouTube +2

5. How do you handle NULL values in the updated_at column during a timestamp strategy?
Answer: If updated_at can be NULL, the snapshot might fail or not update correctly. A best practice is to use a COALESCE function in the strategy configuration to default NULL to a very old date or to use a surrogate created_at timestamp to ensure the row is captured.

YouTube +3

6. What are the key configuration columns generated by dbt snapshots?
Answer:
dbt_scd_id: A unique surrogate key for the snapshot record.
dbt_valid_from: When this version of the record became valid.
dbt_valid_to: When this version of the record became invalid.
dbt_updated_at: The timestamp the record was updated in the source.

YouTube +3

7. How would you optimize a snapshot that is running slowly?
Answer:
Incremental Snapshots: Ensure the source data is pre-filtered if possible.
Clustering Keys: Add clustering keys to the snapshot table in Snowflake to speed up dbt_valid_to updates.
Warehouse Size: Increase the Snowflake warehouse size during the snapshot run.
Check Strategy Optimization: Reduce the number of columns being checked in a check strategy.

YouTube +3

8. How do you use a snapshot in a downstream dbt model?
Answer: Use the {{ ref(‘snapshot_name’) }} function to reference the snapshot in your transformation models, just like any other dbt model or source.
TEST:
Core Concepts & Architecture
Explain the Snowflake architecture. It’s a hybrid shared-disk/shared-nothing architecture with three layers: Cloud Services, Query Processing (Virtual Warehouses), and Database Storage. This separation of compute and storage is a key differentiator.
What are Virtual Warehouses, and how do you size/scale them? Virtual warehouses are independent compute clusters. Sizing (XS, S, M, etc.) impacts performance and cost. You can use multi-cluster warehouses and auto-scaling policies to handle concurrency and different workloads efficiently.
Explain Snowflake’s caching mechanism. There are three main types: Result Cache (stores query results for 24 hours), Local Disk Cache (caches data on the warehouse nodes for active queries), and Remote Disk Cache (long-term data caching in the storage layer).

InterviewBit +4

Data Loading and Transformation
Describe the process of loading data into Snowflake. Typically involves staging data (internal or external stages like AWS S3), defining file formats, and using the COPY INTO command. For continuous ingestion, Snowpipe is used.
What is Snowpipe, and when would you use it? Snowpipe is a continuous data ingestion service that loads data in micro-batches using event notifications (e.g., S3 events) for near real-time analytics.
How does Snowflake handle semi-structured data (e.g., JSON)? Snowflake natively supports semi-structured data using the VARIANT data type and provides functions to query it directly without a predefined schema.

InterviewBit +4

Advanced Features & Optimization
Explain Time Travel and Fail-safe.
Time Travel allows accessing historical data within a defined retention period (default 1 day, up to 90 days for Enterprise Edition) for querying, cloning, or restoring dropped objects.
Fail-safe is a 7-day data recovery period that starts after the Time Travel retention ends, used only for disaster recovery by Snowflake support.
What is Zero-copy cloning? It allows creating a duplicate of a database, schema, or table without physically copying the underlying storage, using metadata pointers. This is useful for development, testing, and creating historical snapshots.
How do you optimize a slow-running query in Snowflake? Check the query profile for bottlenecks. Optimization techniques include proper warehouse sizing, reducing unnecessary joins, using clustering keys on large tables, and leveraging materialized views for repeated aggregations.

LinkedIn +4

dbt (Data Build Tool) Interview Questions
Questions will focus on project structure, testing, deployment, and best practices within the dbt ecosystem.
Core Concepts
What is dbt, and how does it differ from traditional ETL? dbt (Data Build Tool) focuses on the “T” in ELT, transforming and modeling data within the data warehouse using SQL (or Python with Snowpark). Unlike traditional ETL tools, it leverages the power of the data warehouse and integrates with version control (Git).
Explain the difference between ref() and source().
ref() is used to reference other dbt models within your project, automatically managing dependencies and build order.
source() is used to reference raw, untransformed data tables that are already loaded into your warehouse (defined in a sources.yml file).
What are dbt models and materializations? A dbt model is a select statement defined in a .sql file. Materialization types define how the model is built in the warehouse:
view (default): a virtual table created at runtime, useful for simple transformations.
table: a physical table created in the database, suitable for intermediate or final datasets.
incremental: updates an existing table with only new or changed data, ideal for large datasets and frequent updates.
ephemeral: used as a CTE in downstream models, not created as a persistent database object.

Medium +4

Testing and Deployment
How do you test your dbt models? You use tests defined in .yml files to ensure data quality and integrity. Dbt provides generic tests like unique, not_null, accepted_values, and relationships. You can also write custom SQL tests.
Explain how dbt manages dependencies and execution order. Dbt uses a Directed Acyclic Graph (DAG) to understand the relationships between models and sources. It determines the correct sequence to build models, ensuring dependencies are built first.
How do you deploy dbt models to production? Deployment often involves using dbt Cloud or setting up a CI/CD pipeline (e.g., using Jenkins, Airflow) that compiles and runs the models in the production environment.

Medium +4

Scenario-Based Questions
Describe a challenging dbt project you worked on. Discuss issues such as optimizing a slow-running incremental model, handling schema changes in production, or implementing complex data quality checks, and how you solved them using dbt features and best practices.
How do you handle schema evolution in an incremental model? Dbt’s incremental materialization has strategies like merge to manage column additions. For more complex changes (e.g., column type changes), a full-refresh run or a more advanced strategy might be needed.
MACRO HOOKS:
1. dbt Macros (Reusable Code)
Q1: What are dbt macros, and why would you use them in a 4-year experience scenario?
Answer: Macros are Jinja-based SQL functions that allow reusability of code, similar to functions in traditional programming. At 4 years experience, they are crucial for:
DRY (Don’t Repeat Yourself): Standardizing complex logic like surrogate key generation, SCD type-2 logic, or specific Snowflake data masking.
Custom Materializations: Defining how a table is built.
Dynamic SQL: Generating SQL based on environmental variables or configuration.

Medium +2

Q2: How do you pass arguments to a macro and call it in a model?
Answer: Define the macro in macros/ folder using {% macro macro_name(arg1, arg2) %} … {% endmacro %}. Call it within a model using {{ macro_name(val1, val2) }}.
Q3: Explain the difference between {{ ref() }} and {{ source() }} within a macro.
Answer:
ref(): Used for building dependency graphs between dbt models within the project. It ensures proper order of execution.
source(): Used to reference raw data loaded by external tools (like Fivetran/Airbyte) defined in sources.yml. It handles raw table dependencies.

ProjectPro +3

Q4: Scenario: You need to generate a surrogate key based on different columns in 50 different tables. How would you solve this?
Answer: I would create a custom macro in macros/generate_surrogate_key.sql that takes a list of columns, casts them to string, concatenates them, and applies a hashing function (e.g., md5).
sql
— Example call
{{ dbt_utils.generate_surrogate_key([‘col1’, ‘col2’]) }}

2. dbt Hooks (Pre/Post-hooks)
Q5: What are hooks in dbt, and when do they run?
Answer: Hooks are SQL statements executed before (pre-hook) or after (post-hook) a model, seed, or snapshot runs.
Pre-hook: Useful for cleaning up, creating audit tables, or setting session-level configurations.
Post-hook: Ideal for granting permissions, creating indexes (in other DBs), setting clustering keys in Snowflake, or updating metadata.

YouTube +2

Q6: Scenario: After a table is rebuilt in Snowflake, you need to grant SELECT access to the ANALYST_ROLE. How do you automate this?
Answer: I would use a post-hook in the model configuration.
sql
{{ config(
materialized=’table’,
post_hook=[“GRANT SELECT ON {{ this }} TO ROLE ANALYST_ROLE”]
) }}
Alternative: I would create a generic grant_select macro and call it in dbt_project.yml for all models in a specific directory.
Q7: How can you use a hook to set a query tag in Snowflake for audit purposes?
Answer: By using a pre-hook to set the session variable:
sql
{{ config(
pre_hook=[“ALTER SESSION SET QUERY_TAG = ‘dbt_model_run'”]
) }}

Q8: Can you execute a stored procedure using a dbt hook?
Answer: Yes, by using CALL my_procedure(); within a pre-hook or post-hook.

3. Advanced Scenario-Based Questions (4+ Years Exp)
Q9: Explain the use of {{ this }} in dbt hooks.
Answer: {{ this }} is a Jinja variable that refers to the current model’s database object (database, schema, table/view name). It is essential in post-hooks for granting privileges or applying clustering keys on the table that was just built.

YouTube

Q10: What is the impact of using hooks on DAG performance?
Answer: Hooks run sequentially with the model, which can slow down the overall dbt run time if they are complex. It is best to keep hooks lightweight.
Q11: Scenario: You have a large table that takes too long to run. You want to implement a custom, complex incremental update using a macro + hook. What’s your approach?
Answer:
Create a macro that takes table_name and unique_key.
Use if is_incremental() inside the model to identify new rows based on max(updated_at).
Use a post-hook to run a DELETE on the table to remove rows existing in the new data, then an INSERT (SCD Type 1 behavior).

Medium +1

Q12: How would you structure your dbt_project.yml to apply a post-hook to all models in the marts/ directory?
Answer:
yaml
models:
my_project:
marts:
+post-hook: “GRANT SELECT ON {{ this }} TO ROLE REPORTER”

Summary of Key Concepts
Macros = Reusable Jinja/SQL code.
Hooks = Before/After actions (pre-hook, post-hook).
{{ this }} = Current model.
{{ ref() }} = Upstream dependency.
{{ source() }} = Raw data dependency.
SEED EXPOSURE:
I. DBT Seeds (Data Seeding)
Advanced Focus: Handling large reference data, automation, and schema evolution.
Q1: How do you handle schema changes in seed files?
Answer: dbt seed does not automatically update column types. If a column type changes in the CSV, I run dbt seed –full-refresh to recreate the table with the new schema.
Q2: When would you use a seed vs. a table model?
Answer: I use seeds for small, static mapping files (e.g., country_codes.csv, category_mapping.csv) that rarely change and are maintained by analysts in CSV format. I use model tables for data that requires heavy transformation, has high volume, or changes frequently.

Q3: How do you handle a scenario where a seed file has >100,000 rows, making dbt seed slow?
Answer: For large reference data, I avoid dbt seed. Instead, I load the CSV into a raw table using a tool like Fivetran/Snowpipe, and then create a dbt model on top of it. This allows for incremental loading and better performance, as dbt seed truncates and reloads the entire file.

II. DBT Exposures (Data Lineage & Downstream Usage)
Advanced Focus: Lineage, impact analysis, and documentation.
Q4: What is a dbt exposure and why is it crucial for a 4-year experienced developer?
Answer: Exposures define downstream consumers (Dashboards, ML models, Data science notebooks) in yaml files. It bridges the gap between dbt models and BI tools (Looker/Tableau).
Crucial because: It enables impact analysis (knowing what breaks if I change a model) using dbt run -s +exposure:name and improves documentation.

Q5: How do you automate the updating of exposures?
Answer: Instead of manually defining them, I use CI/CD scripts or tools like AirOps to scan the BI tool API and generate the exposure YAML dynamically, ensuring they are always up-to-date.

Q6: Scenario: A Tableau dashboard is breaking, and the data owner blames the data team. How do you prove the data is correct?
Answer: I use the lineage graph produced by dbt docs generate to show the full downstream dependency. I then use dbt test to show that all tests on the upstream models passed at the time of the last run, isolating the issue to either the BI report logic or the data warehouse layer.

III. DBT Hooks (Automation)
Advanced Focus: Grant management, audit logging, and performance tuning.
Q7: Explain the difference between pre-hooks and post-hooks, and provide a production use case for each.
Answer:
Pre-hook: Executes SQL before a model runs. Use case: ALTER SESSION SET TIMEZONE=’UTC’ or deleting old data in an incremental model.
Post-hook: Executes SQL after a model runs. Use case: Granting read permissions to a reporting role (GRANT SELECT ON {{ this }} TO ROLE REPORTING_USER) or logging execution metadata.
Q8: How do you implement automated auditing for dbt model execution?
Answer: I use a post-hook macro that inserts a record into an audit table: INSERT INTO audit_log (model_name, execution_time, status) VALUES (‘{{ this }}’, CURRENT_TIMESTAMP, ‘SUCCESS’).
Q9: Can you define a hook that applies to the entire project?
Answer: Yes, in dbt_project.yml under models:, I define post-hook to grant access to all materialized models.
IV. Snowflake & DBT Integration (4 Years Exp)
Advanced Focus: Performance, cost management, and advanced features.
Q10: How do you optimize a slow-performing dbt model that joins multiple large tables in Snowflake?
Answer:
Materialization: Changed from view to table or incremental.
Clustering: Define cluster_by in the config to optimize on columns frequently used in WHERE/JOIN clauses (e.g., event_date).
CTE Refactoring: Break down complex, nested CTEs into smaller intermediate models (int_ layer) to allow Snowflake to optimize better.
Q11: How do you manage warehouse costs for a heavy dbt production run?
Answer: I assign specific, transient virtual warehouses for dbt jobs, typically a Large or X-Large for the run, and set auto_suspend to a low value (e.g., 60 seconds) to ensure it shuts down immediately after the dbt run completes.
Q12: What is “Zero-Copy Cloning” and how have you used it in a dbt project?
Answer: Zero-copy cloning creates a snapshot of a database/table without copying physical data. Usage: I clone the production database to a dev database (CREATE DATABASE dev_db CLONE prod_db) to test new dbt models, ensuring I am testing on real data without affecting production.
Q13: Scenario: You need to implement Slowly Changing Dimension (SCD) Type 2 in Snowflake using dbt. How?
Answer: I use dbt snapshots. I define a unique key and an updated-at timestamp in the snapshot config. dbt handles the logic of invalidating old records and inserting new ones.

Leave a comment

Subscribe for material notification

Fill this form to get notify when we post new study materials on Raj Informatica.

Post Notification
Table of Contents

Recent Comments

  • T PUNDALIK 10 November 2023
    Thank you Raj sir nice explanation no one can have this much patience to
  • Sri Lalitha 6 November 2023
    Thank you for sharing your knowledge sir, very helpful to face the interview.
  • jyothi 6 November 2023
    hello sir. thank you for this understanding material like notes
  • ramu 6 November 2023
    good understanding sir
  • veena 6 November 2023
    Hi sir, this notes has clearily explained the what is iics, and difference between
  • durgaashok 6 November 2023
    Good notes and easy
  • Gouda Anusha 6 November 2023
    Hey Sir, I have read all oracle commands and understood very well and its clear
  • srinivas 6 November 2023
    Good content and notes
  • Gouda Anusha 6 November 2023
    Hey Sir, I have read all oracle commands and understood very well and its clear
  • Sravanajyothi 6 November 2023
    Hi Sir, Thank you for all information provided about the Expression transformation which is very
  • Lokesh 6 November 2023
    Good Stuff for Interview preparation process
  • lalitha 5 November 2023
    Hello Raj sir..thank you for your Excellent Training and Motivating every student to get a
Back to Top
Product has been added to your cart

Continue Reading Snowflake Interview Questions Set-05

Fill out this form to explore all our interview question sets.

Question-Popup

Instructor-Led IT Software Training

Get your dream IT job at top MNCs in just 45 days!

Interested to join? fill this form👇🏻 we'll contact you!

Contact Form