Towards AI•3 days ago
Unlocking Embedded Visuals from Documents Using Snowflake Cortex
Last Updated on February 19, 2026 by Editorial Team Author(s): Krishnan Srinivasan Originally published on Towards AI. Most document AI discussions focus on text extraction, OCR accuracy, table detection, layout parsing. These are familiar themes. But many enterprise documents are not just text-centric. Inspection reports, audit documents, supplier catalogs, safety manuals, and compliance records often rely on embedded images as the primary source of evidence. Treating those documents as text-first artifacts leaves an important gap. Snowflake Cortex addresses this gap by making image extraction a native capability of document parsing itself. With AI_PARSE_DOCUMENT, images are not handled as a secondary concern or an external post-processing step. Instead, they are surfaced as structured outputs alongside text and layout, using nothing more than SQL. Support for extracting embedded images in AI_PARSE_DOCUMENT was announced in preview last week. This blog walks through a complete, end-to-end image extraction flow using a real PDF with embedded images and explains each step in detail. The goal here is simple: take a PDF that contains embedded images, extract those images using AI_PARSE_DOCUMENT, and persist them as governed assets inside Snowflake for downstream use. We will use a warehouse inspection report to demonstrate this. In such a report, images are not decorative additions but primary evidence. Here’s s snapshot from the report. (the pdf is available with the code) The warehouse floor image visually substantiates claims about clear aisles, organized racking, and safe equipment placement, while the electrical panel image supports compliance statements around labeling, hazard warnings, and unobstructed access. Extracting and storing these images independently allows them to be reused as auditable artifacts during safety reviews, compliance checks, or follow-up inspections, without repeatedly reprocessing the full PDF. This shifts inspection images from static report attachments into governed data assets that strengthen traceability and trust across operational workflows. ( For context, take a look at the report before proceeding). Let us now proceed with the implementation. Step 1: Create a workspace and staging areas We begin by setting up a small, isolated workspace for the demo. Two stages are created: one to store the input PDF and another to store the extracted images. Directory-enabled stages allow us to query file metadata directly using SQL, which becomes useful later. At this point, Snowflake is ready to treat documents as first-class inputs to Cortex AI. Step 2: Upload the inspection report PDF The demo uses a simple file named Inspection_Report.pdf. The PDF documents a routine warehouse inspection, summarizing floor conditions, storage and equipment safety, electrical panel compliance, and overall operational readiness, reflecting a real-life operational inspection scenario. It contains descriptive text and a couple of embedded images (warehouse floor and an electrical panel, as shown above), making it ideal for demonstrating extraction. Upload the file into the input stage that we created (INPUT_PDFS). Step 3: Verify that the file is persistent and accessible This check avoids most downstream errors related to file paths or stage configuration. SELECT relative_path, size, last_modifiedFROM DIRECTORY(@IMAGE_EXTRACT_DEMO.DOCS.INPUT_PDFS)ORDER BY last_modified DESC; Step 4: Parse the document with image extraction enabled This is the core operation. Image extraction in Snowflake requires layout-aware parsing, so the document is processed using mode = ‘LAYOUT’ with extract_images explicitly enabled. SELECT AI_PARSE_DOCUMENT( TO_FILE(‘@IMAGE_EXTRACT_DEMO.DOCS.INPUT_PDFS’, ‘Inspection_Report.pdf’), { ‘mode’: ‘LAYOUT’, ‘extract_images’: TRUE } ) AS parsed_document; The result is a structured JSON object representing the document. Text blocks, layout elements, and images are all included. For this blog, the focus is on the images array inside the result, which contains each extracted image as a structured object. Step 5: Validate that images were located Before doing anything else, it is important to confirm that images were actually found. This query simply counts the number of extracted images. WITH doc AS ( SELECT AI_PARSE_DOCUMENT( TO_FILE(‘@IMAGE_EXTRACT_DEMO.DOCS.INPUT_PDFS’, ‘Inspection_Report.pdf’), { ‘mode’: ‘LAYOUT’, ‘extract_images’: TRUE } ) AS parsed_document)SELECT ARRAY_SIZE(parsed_document:images) AS image_countFROM doc; You should see the IMAGE_COUNT returning 2, confirming that the document contains two extractable embedded images and that the parsing configuration is correct. Step 6: Inspect image structure and metadata At this stage, the goal is simply to confirm that embedded images have been successfully extracted from the document. For image extraction workflows, the most reliable signal is the presence of a non-empty Base64 payload for each image entry returned by AI_PARSE_DOCUMENT. A non-zero base64_length confirms that the image bytes were successfully extracted. The prefix helps visually validate that the payload is a valid Base64-encoded image rather than an empty or malformed value. Step 7: Persist extracted images as stage files In many workflows, extracted images need to be stored independently so they can be reused, audited, or consumed by other pipelines. The procedure below decodes each Base64 payload and writes the image as a file into a destination stage. The procedure acts as a simple materialization step between document parsing and downstream consumption. It takes the structured JSON output produced by AI_PARSE_DOCUMENT , iterates over the extracted images array, and decodes each Base64-encoded image payload into its original binary form. Each decoded image is then written as an individual file into a Snowflake stage, with a filename. This converts images from in-memory JSON fields into durable, governed assets that can be listed, referenced, and reused independently of the original PDF. By keeping this logic inside Snowflake, the procedure ensures that image extraction, persistence, and access control remain part of the same secure data workflow rather than being split across external tools. CREATE OR REPLACE PROCEDURE IMAGE_EXTRACT_DEMO.DOCS.SAVE_EXTRACTED_IMAGES( PARSED_RESULT VARIANT, DEST_STAGE_PATH STRING)RETURNS ARRAYLANGUAGE PYTHONRUNTIME_VERSION = ‘3.10’PACKAGES = (‘snowflake-snowpark-python’)HANDLER = ‘run’AS$$import base64import osimport reimport tempfile def run(session, parsed_result, dest_stage_path): if not dest_stage_path or not dest_stage_path.startswith(“@”): return [“Destination stage path must start with @”] images = parsed_result.get(“images”) or [] if len(images) == 0: return [“No images found in parsed document”] written_files = [] with tempfile.TemporaryDirectory() as temp_dir: for index, image in enumerate(images): image_id = image.get(“id”, f”image_{index:03d}”) base64_payload = image.get(“image_base64”, “”) base64_payload = re.sub( r”^data:image/[^;]+;base64,”, “”, base64_payload ).strip() if not base64_payload: continue binary_image = base64.b64decode(base64_payload) file_name = f”{image_id}.bin” local_path = os.path.join(temp_dir, file_name) with open(local_path, “wb”) as file: file.write(binary_image) session.file.put( local_path, […]