Category: Master Guide

When merging PDF archives for long-term digital preservation, what strategies ensure the integrity of embedded timestamps, creation dates, and author attribution across all consolidated documents?

# The Ultimate Authoritative Guide to PDF Merging for Long-Term Digital Preservation: Ensuring Metadata Integrity ## Executive Summary In the critical domain of long-term digital preservation, the integrity of metadata embedded within PDF documents is paramount. When consolidating multiple PDF archives into a single, cohesive repository using tools like `merge-pdf`, preserving original creation dates, modification timestamps, and author attributions is not merely a best practice; it is a fundamental requirement for authenticity, discoverability, and legal defensibility. This comprehensive guide, authored from the perspective of a Principal Software Engineer, delves into the intricate strategies necessary to safeguard these vital metadata elements during the PDF merging process. We will explore the technical underpinnings of PDF metadata, analyze the behavior of `merge-pdf` and other potential tools, present practical scenarios and their solutions, align with global industry standards, provide a multilingual code repository, and project future trends in this evolving field. Our objective is to equip archivists, IT professionals, and digital preservationists with the knowledge and tools to perform PDF merges that uphold the highest standards of data integrity. ## Deep Technical Analysis: The Anatomy of PDF Metadata and Merging Implications To effectively merge PDFs while preserving metadata, a profound understanding of the PDF file structure and how metadata is represented is essential. ### 3.1 Understanding PDF Metadata PDF (Portable Document Format) is a complex file format developed by Adobe. Its structure is based on a cross-referenced object system, where various elements like pages, fonts, images, and metadata are stored as distinct objects. #### 3.1.1 The Info Dictionary The primary location for document-level metadata in a PDF is the **Info dictionary**. This dictionary is an associative array containing key-value pairs that describe the document's properties. Key fields relevant to our discussion include: * **`/Title`**: The title of the document. * **`/Author`**: The author of the document. * **`/Subject`**: The subject of the document. * **`/Keywords`**: Keywords associated with the document. * **`/Creator`**: The application that created the original document. * **`/Producer`**: The application that converted the document to PDF. * **`/CreationDate`**: The date and time the document was created. * **`/ModDate`**: The date and time the document was last modified. These values are typically stored as PDF string objects. The format for date and time values is specified by the PDF standard and generally follows the pattern `D:YYYYMMDDHH'MM'SS[+/-]HH'MM'`. #### 3.1.2 XMP Metadata (Extensible Metadata Platform) Beyond the Info dictionary, modern PDFs often contain **XMP metadata**. XMP is an Adobe-developed standard that allows for richer and more structured metadata. XMP data is typically embedded within the PDF as an XML stream. This XML can contain a vast array of properties, including those found in the Info dictionary, but also much more granular information, such as: * Dublin Core metadata elements. * IPTC (International Press Telecommunications Council) data. * EXIF (Exchangeable Image File Format) data (for image-based PDFs). * Rights management information. * Custom metadata schemas. XMP metadata is often preferred for its flexibility and extensibility. It is usually found within a dedicated XMP metadata stream object within the PDF. #### 3.1.3 Page-Level Metadata While less common for global document properties, some PDF elements and annotations can have their own associated metadata. However, for the purpose of merging entire documents, our focus remains on the document-level Info dictionary and XMP metadata. ### 3.2 The `merge-pdf` Tool: Capabilities and Limitations `merge-pdf` is a command-line utility designed for combining multiple PDF files into a single output file. Its primary function is to append the pages of the input PDFs sequentially. However, its handling of metadata during this process requires careful consideration. #### 3.2.1 Default Metadata Handling By default, many PDF merging tools, including `merge-pdf` (depending on its specific implementation and options), often: * **Propagate the metadata of the *first* input PDF to the merged document.** This means the `/Title`, `/Author`, `/CreationDate`, etc., of the initial PDF in the merge sequence will likely become the metadata for the entire consolidated document, overwriting or ignoring metadata from subsequent files. * **Generate a new `/CreationDate` and `/ModDate` for the *merged* document.** This new timestamp will reflect when the merge operation itself occurred, not the original creation or modification dates of the constituent documents. * **May discard or corrupt XMP metadata.** The merging process might not have the logic to correctly parse, merge, or re-embed XMP streams, leading to data loss. #### 3.2.2 `merge-pdf` Specifics and Options The `merge-pdf` tool, as a conceptual entity representing a common PDF merging utility, may offer specific command-line arguments to influence metadata handling. It is crucial to consult the tool's documentation for precise details. Common options that might exist or be analogous in other tools include: * **`--output-metadata` / `--metadata-source`**: Options to specify which input file's metadata should be used for the merged document. * **`--preserve-metadata`**: A flag that *attempts* to preserve metadata, though its effectiveness can vary. * **`--merge-strategy`**: Potentially a parameter to define how conflicting metadata should be resolved. **Crucially, a direct, built-in feature in most simple PDF merging tools to "aggregate" or "preserve all original metadata" from multiple source files into a single merged document is generally *not* a standard offering.** The concept of a single document having multiple distinct original creation dates or authors is inherently contradictory to the structure of a single Info dictionary or XMP stream. ### 3.3 Challenges in Preserving Metadata During Merging The core challenge lies in the inherent design of PDF metadata. A single PDF document has a single set of `/Title`, `/Author`, `/CreationDate`, etc. When you merge two documents, you are creating a *new* document. * **Conflicting Metadata:** If Document A was created on Jan 1, 2020, by Author X, and Document B was created on Feb 1, 2021, by Author Y, how do you represent this in a single merged document? * Should the merged document be attributed to Author X or Author Y? * What should its creation date be? Jan 1, 2020, Feb 1, 2021, or the date of the merge? * **Loss of Granularity:** Simply copying the metadata of the first document leads to a loss of information about the other documents. * **XMP Complexity:** XMP metadata can be highly structured and may contain relationships between different data elements. A simple append operation can break these relationships or render the XMP invalid. * **Timestamp Accuracy:** The `/CreationDate` and `/ModDate` fields are critical for establishing provenance. If these are overwritten with the merge date, the historical record is compromised. ### 3.4 Strategies for Metadata Integrity Given these challenges, a proactive and multi-faceted strategy is required. This strategy involves: 1. **Understanding the Source Metadata:** Before merging, thoroughly audit the metadata of each PDF to be consolidated. 2. **Defining a Metadata Preservation Policy:** Establish clear rules for how metadata will be handled for the consolidated document. This policy will dictate which metadata takes precedence or how it will be represented. 3. **Leveraging Advanced Tools or Custom Scripting:** Relying solely on basic merge utilities may be insufficient. Advanced PDF manipulation libraries or custom scripts are often necessary. 4. **External Metadata Archiving:** For critical preservation, consider maintaining an external metadata catalog. 5. **Post-Merge Verification:** Implement rigorous checks to ensure metadata integrity after the merge operation. ### 3.5 `merge-pdf` and Metadata: A Practical Approach Assuming `merge-pdf` is a tool that primarily appends pages, its default behavior will likely lead to metadata loss or overwriting. To achieve metadata integrity, we must adopt a strategy that goes beyond simple concatenation. This often involves: * **Pre-processing:** Extracting metadata from each PDF before merging. * **Post-processing:** Injecting or modifying metadata in the *final* merged PDF based on the extracted information and the defined policy. This implies that `merge-pdf` might be used for the *page concatenation* part, but other tools or scripts will be needed for the metadata handling. ## Deep Dive into Metadata Preservation Strategies with `merge-pdf` The core challenge when merging PDFs for long-term preservation using `merge-pdf` is that it's fundamentally a page-level operation. It concatenates the page streams of input documents. The metadata associated with the *original* documents often gets lost or overwritten by the metadata of the *first* document in the sequence, or by a new timestamp reflecting the merge operation itself. To ensure the integrity of embedded timestamps, creation dates, and author attribution across all consolidated documents, we must employ strategies that go beyond the basic functionality of `merge-pdf`. This requires a sophisticated approach involving pre-processing, intelligent merging, and post-processing. ### 4.1 Strategy 1: Metadata Extraction and Re-injection (The Gold Standard) This is the most robust approach for long-term preservation, as it ensures that original metadata is not only preserved but also made accessible and associated with the correct original document. #### 4.1.1 The Process 1. **Metadata Extraction:** * For each PDF file to be merged, extract its Info dictionary metadata and XMP metadata. This requires a PDF parsing library capable of reading these structures. * Store this extracted metadata in a structured format (e.g., JSON, XML, CSV), linking it to the original filename. * **Crucially, for dates and timestamps, record them precisely as they appear in the PDF.** Do not normalize them at this stage unless explicitly required by your archival standards. 2. **Page Merging with `merge-pdf`:** * Use `merge-pdf` (or a similar tool) to perform the actual concatenation of the PDF pages. * **Important:** At this stage, the metadata of the merged PDF is largely irrelevant. We are focused on creating a single document containing all the pages. If `merge-pdf` offers an option to *not* overwrite metadata, it might be useful, but we cannot rely on it to preserve *all* original metadata. The primary goal here is page integrity. 3. **Metadata Re-injection (Post-processing):** * This is the most complex step. A single merged PDF cannot inherently contain multiple distinct creation dates or authors for its constituent parts. Therefore, the "re-injection" refers to: * **Setting the metadata of the *merged* document:** You need to define a policy. Common policies include: * **Use the metadata of the *first* document:** This is often the simplest, but results in loss of information about other documents. * **Use the metadata of the *most recent* document:** Based on `/CreationDate`. * **Use the metadata of the *oldest* document:** Based on `/CreationDate`. * **Create a new, generic metadata set:** For example, setting `/Author` to "Archival System" and `/CreationDate` to the date of the merge. * **Creating an *external* metadata record:** This is the most comprehensive approach for preservation. The merged PDF is the container, and the external metadata record (e.g., an XML file, a database entry) holds the original metadata for each constituent document. This external record would include: * Original filename. * Original `/Title`, `/Author`, `/CreationDate`, `/ModDate`, XMP data. * Its position within the merged document (e.g., page range). * A hash of the original PDF for integrity verification. 4. **Verification:** * After re-injection (or creation of external metadata), verify that the desired metadata has been applied to the merged PDF. * If using external metadata, ensure the link between the merged PDF and its external metadata record is secure and accessible. * Perform checksums on the merged PDF to ensure its content hasn't been corrupted. #### 4.1.2 Technical Implementation Considerations * **PDF Parsing Libraries:** * **Python:** `PyPDF2`, `pikepdf` (more robust for XMP and object manipulation), `pdfminer.six`. * **Java:** Apache PDFBox. * **JavaScript (Node.js):** `pdf-lib`. * **Metadata Format:** JSON is often a good choice for structured metadata due to its readability and widespread support. * **Timestamp Handling:** Be mindful of timezones and the specific format (`D:YYYYMMDDHH'MM'SS[+/-]HH'MM'`). Libraries will usually help parse and format these. #### 4.1.3 Example (Conceptual Python Script) python import PyPDF2 import json import os from datetime import datetime def extract_and_merge_pdfs(pdf_files, output_pdf, metadata_output_json): """ Extracts metadata from PDFs, merges them, and creates an external metadata record. """ all_metadata = [] pdf_writer = PyPDF2.PdfWriter() for pdf_path in pdf_files: try: with open(pdf_path, 'rb') as file: pdf_reader = PyPDF2.PdfReader(file) # --- Metadata Extraction --- current_metadata = { "original_filename": os.path.basename(pdf_path), "page_count": len(pdf_reader.pages) } # Info Dictionary info = pdf_reader.metadata if info: for key, value in info.items(): # Clean up keys and handle potential non-string values clean_key = key.lstrip('/') if isinstance(key, str) else str(key) if isinstance(value, str): current_metadata[clean_key] = value else: # Attempt to convert common types or store as string try: current_metadata[clean_key] = str(value) except: pass # Ignore if cannot be converted # XMP Metadata (More complex, requires dedicated parsing or library support) # PyPDF2's direct XMP access is limited. pikepdf is better. # For simplicity here, we'll acknowledge its importance but not fully implement XMP extraction with PyPDF2. # A more advanced script would use pikepdf for XMP: # import pikepdf # try: # with pikepdf.open(pdf_path) as pk_pdf: # if pk_pdf.docinfo.get('/XMPMetadata'): # current_metadata["xmp_data"] = pk_pdf.docinfo['/XMPMetadata'].read().decode('utf-8') # except Exception as e: # print(f"Warning: Could not extract XMP for {pdf_path}: {e}") all_metadata.append(current_metadata) # --- Page Merging --- for page_num in range(len(pdf_reader.pages)): page = pdf_reader.pages[page_num] pdf_writer.add_page(page) except FileNotFoundError: print(f"Error: File not found - {pdf_path}") except Exception as e: print(f"Error processing {pdf_path}: {e}") # --- Writing the Merged PDF --- if len(pdf_files) > 0: # Setting metadata for the *merged* document. # Policy: Use metadata from the first document as primary for the merged output. merged_doc_metadata = {} if all_metadata: first_doc_meta = all_metadata[0] if "Title" in first_doc_meta: merged_doc_metadata["/Title"] = first_doc_meta["Title"] if "Author" in first_doc_meta: merged_doc_metadata["/Author"] = first_doc_meta["Author"] if "CreationDate" in first_doc_meta: merged_doc_metadata["/CreationDate"] = first_doc_meta["CreationDate"] else: # Fallback if no original creation date merged_doc_metadata["/CreationDate"] = f"D:{datetime.now().strftime('%Y%m%d%H%M%S%z')}" # Add a modification date for the merge operation merged_doc_metadata["/ModDate"] = f"D:{datetime.now().strftime('%Y%m%d%H%M%S%z')}" merged_doc_metadata["/Producer"] = "Custom Preservation Merger Script" # Indicate origin # Add metadata to the PdfWriter object # Note: PyPDF2's PdfWriter.add_metadata is deprecated, use PdfWriter.add_information # pdf_writer.add_metadata(merged_doc_metadata) # Deprecated pdf_writer.add_information(merged_doc_metadata) with open(output_pdf, 'wb') as out_file: pdf_writer.write(out_file) print(f"Successfully merged PDFs into: {output_pdf}") # --- Saving External Metadata --- with open(metadata_output_json, 'w', encoding='utf-8') as md_file: json.dump(all_metadata, md_file, indent=4, ensure_ascii=False) print(f"External metadata saved to: {metadata_output_json}") else: print("No PDF files provided to merge.") # --- Usage Example --- if __name__ == "__main__": # Create dummy PDFs for demonstration (requires a separate tool or manual creation) # For this example, assume you have pdf1.pdf, pdf2.pdf etc. in the same directory # Example: Let's assume we have 3 PDFs: # pdf1.pdf (Created: 2020-01-15, Author: Alice) # pdf2.pdf (Created: 2021-03-22, Author: Bob) # pdf3.pdf (Created: 2022-07-01, Author: Charlie) # Placeholder for actual PDF files input_pdf_files = ["pdf1.pdf", "pdf2.pdf", "pdf3.pdf"] # Create dummy files if they don't exist for testing purposes # This part is complex and depends on external libraries/tools to create PDFs with specific metadata. # For a real scenario, you'd have actual PDF files. # For this example, let's skip dummy PDF creation and assume they exist. output_merged_pdf = "preserved_archive.pdf" output_metadata_json = "preserved_archive_metadata.json" # Ensure input files exist before calling (for a real script, add checks) # For demonstration, let's simulate their existence: print("Simulating existence of input PDF files: pdf1.pdf, pdf2.pdf, pdf3.pdf") # In a real script, you'd check: # if not all(os.path.exists(f) for f in input_pdf_files): # print("Please ensure all input PDF files exist.") # exit() extract_and_merge_pdfs(input_pdf_files, output_merged_pdf, output_metadata_json) print("\n--- Verification Notes ---") print("1. Open 'preserved_archive.pdf' and check its File > Properties (or equivalent).") print(" - The Title, Author, and CreationDate will likely reflect 'pdf1.pdf' (or the first file processed).") print(" - A ModDate will be present reflecting the merge time.") print("2. Open 'preserved_archive_metadata.json'.") print(" - This file contains the *original* metadata for each constituent PDF.") print(" - It serves as the authoritative record of original timestamps and authors.") print(" - This JSON file should be preserved alongside the merged PDF.") #### 4.1.4 Advantages * **Maximum Fidelity:** Preserves all original metadata in an accessible format. * **Auditability:** Provides a clear trail of original document properties. * **Compliance:** Meets rigorous archival requirements. * **Flexibility:** External metadata can be searched, queried, and presented independently. #### 4.1.5 Disadvantages * **Complexity:** Requires scripting and a deeper understanding of PDF structures. * **External Dependency:** Requires managing and preserving the external metadata file alongside the merged PDF. * **Tooling:** May need to use more advanced libraries than a simple `merge-pdf` CLI. ### 4.2 Strategy 2: Metadata Tagging and Annotation (Less Ideal for Preservation) This strategy involves adding information about the original documents *within* the merged PDF itself, rather than solely relying on external files. #### 4.2.1 The Process 1. **Metadata Extraction:** Same as Strategy 1. Extract all original metadata. 2. **Page Merging with `merge-pdf`:** Concatenate pages as usual. 3. **Metadata Injection (Post-processing):** * **Document-Level:** Set the primary metadata of the merged document based on a defined policy (e.g., first document, latest document). * **Annotation/Watermarking:** For each original document's page range within the merged PDF, add: * **Invisible Annotations:** Use PDF annotation objects to store metadata strings associated with specific page ranges. This is technically possible but difficult to implement reliably and may not be recognized by all PDF viewers. * **Visible Watermarks/Footers:** Add visible text to each page indicating the original filename, creation date, and author. This is intrusive but clearly visible. * **Metadata Fields:** Some PDF forms allow for custom metadata fields. If the target system supports it, these could be populated. #### 4.2.2 Technical Implementation Considerations * **PDF Annotation Libraries:** Libraries like `PyMuPDF` (Fitz) in Python or Java's PDFBox are excellent for adding annotations and modifying pages. * **Watermarking Libraries:** Similar libraries can be used to render text on pages. * **XMP Modification:** If XMP is critical, libraries like `pikepdf` can be used to parse and modify XMP packets. You might append new `rdf:Description` blocks to the XMP, each representing an original document, but this can quickly make the XMP overly complex and potentially unstable. #### 4.2.3 Advantages * **Self-Contained:** All information is within the single merged PDF. * **Easier to Distribute:** No need to manage separate metadata files. #### 4.2.4 Disadvantages * **Intrusive:** Visible watermarks alter the document content. * **Less Structured:** Annotations and watermarks are not as easily searchable or queryable as structured metadata. * **Potential for Loss:** Annotations can sometimes be stripped or lost during other PDF operations. * **Complexity:** Programmatically adding annotations to specific page ranges requires careful indexing. * **XMP Fragmentation:** Appending XMP data can lead to very large and potentially invalid XMP structures. ### 4.3 Strategy 3: Metadata Transformation and Aggregation (Compromise) This strategy involves transforming the metadata into a single, aggregate representation for the merged document, acknowledging the loss of individual document specificity. #### 4.3.1 The Process 1. **Metadata Extraction:** Extract all metadata. 2. **Metadata Transformation:** * **`/Title`**: Could become a concatenation like "Original Title 1 | Original Title 2 | ...". * **`/Author`**: Could become a comma-separated list of authors. * **`/CreationDate`**: This is the most problematic. You might choose: * The earliest creation date among all documents. * The latest creation date among all documents. * The date of the merge operation. * A special timestamp indicating a consolidated archive. * **`/ModDate`**: Always the date of the merge operation. * **`/Subject` / `/Keywords`**: Could be concatenated or aggregated. * **XMP:** This is where it gets very difficult. You might try to merge XMP schemas, but this is highly complex and often leads to invalid XMP. It's generally better to use Strategy 1 for XMP. 3. **Metadata Re-injection:** Inject the transformed metadata into the merged PDF using a tool or script. #### 4.3.2 Technical Implementation Considerations * **Custom Scripting:** Requires significant logic to aggregate and transform different metadata fields according to specific rules. * **PDF Manipulation Libraries:** To inject the transformed metadata. #### 4.3.3 Advantages * **Single Document:** All metadata is within the merged PDF. * **Simpler than external metadata:** No separate files to manage. #### 4.3.4 Disadvantages * **Significant Data Loss:** Individual document metadata is lost or heavily obscured. * **Ambiguity:** The meaning of the aggregated metadata can be unclear. * **Not Suitable for True Preservation:** Does not provide an auditable trail of original document properties. ### 4.4 Choosing the Right Strategy For **long-term digital preservation**, **Strategy 1 (Metadata Extraction and Re-injection with External Archiving)** is overwhelmingly the recommended approach. It ensures that the original integrity of timestamps, creation dates, and author attributions is maintained and auditable. The merged PDF becomes the container, and the external metadata record becomes the authoritative source for the original properties. Strategies 2 and 3 represent compromises that may be acceptable for less critical use cases, such as creating a consolidated report where the original provenance is less important than having all content together. However, for archival purposes, they are insufficient. ## Practical Scenarios and Solutions Let's explore common scenarios and how to apply the recommended strategy. ### 5.1 Scenario 1: Consolidating a Series of Reports with Identical Authors but Different Creation Dates **Problem:** You have 10 annual reports, all authored by "Acme Corp," but each created in a different year (2010-2019). You need to merge them into a single archive. **Solution (Strategy 1):** 1. **Extraction:** For each report (e.g., `report_2010.pdf`, `report_2011.pdf`, ...), extract: * `/Author`: "Acme Corp" * `/CreationDate`: e.g., `D:20100115090000+00'00'`, `D:20110115090000+00'00'`, etc. * Store these in a JSON file: `{"original_filename": "report_2010.pdf", "Author": "Acme Corp", "CreationDate": "D:20100115090000+00'00'", ...}` for each. 2. **Merging:** Use `merge-pdf` to combine `report_2010.pdf` through `report_2019.pdf` into `acme_reports_archive.pdf`. 3. **Post-processing (Metadata for Merged Doc):** * Set the `/Author` of `acme_reports_archive.pdf` to "Acme Corp". * Set the `/CreationDate` of `acme_reports_archive.pdf` to the earliest date found (`D:20100115090000+00'00'`). * Set the `/ModDate` to the merge date. 4. **External Metadata:** The generated JSON file containing all original metadata for each report is preserved alongside `acme_reports_archive.pdf`. This JSON is the authoritative record of individual report creation dates. ### 5.2 Scenario 2: Merging Project Deliverables from Multiple Contributors **Problem:** A project involved several sub-teams, each producing PDF deliverables. You need to merge these into a final project archive. Each deliverable has a different author and creation date. **Solution (Strategy 1):** 1. **Extraction:** Extract metadata for each deliverable PDF (e.g., `deliverable_team_a.pdf`, `deliverable_team_b.pdf`). Record: * `/Author`: e.g., "Team A", "Team B" * `/CreationDate`: e.g., `D:20230510143000+01'00'`, `D:20230620110000+00'00'` * Store in JSON. 2. **Merging:** Merge all deliverable PDFs into `project_final_archive.pdf`. 3. **Post-processing (Metadata for Merged Doc):** * Set the `/Author` of `project_final_archive.pdf` to a generic "Project Team" or the lead project manager. * Set the `/CreationDate` of `project_final_archive.pdf` to the date of the final project sign-off or the merge date. * Set the `/ModDate` to the merge date. 4. **External Metadata:** The JSON file serves as the official record of each team's contribution, their original author attribution, and creation timestamps. ### 5.3 Scenario 3: Archiving Legacy Documents with Varying Metadata Quality **Problem:** You are archiving a collection of older PDFs where metadata might be missing or inconsistent. Some might lack author information or have unusual date formats. **Solution (Strategy 1):** 1. **Extraction:** * Use robust PDF parsing libraries that can handle malformed data gracefully. * For missing fields (e.g., `/Author`), record it as "Unknown" or "N/A" in the extracted metadata. * For unusual date formats, attempt to parse them into the standard PDF date format if possible. If not, record the raw string and flag it for manual review. * **Crucially, log any errors or warnings encountered during extraction.** 2. **Merging:** Merge the documents into `legacy_archive.pdf`. 3. **Post-processing (Metadata for Merged Doc):** * Set `/Author` to "Archival System" or a similar identifier. * Set `/CreationDate` to the earliest *valid* creation date found or the merge date if no valid dates exist. * Set `/ModDate` to the merge date. 4. **External Metadata:** The JSON file captures the best possible metadata for each original document, including any notes about missing or problematic fields. This external record is vital for understanding the limitations of the original data. ### 5.4 Scenario 4: Preserving Rich XMP Metadata **Problem:** You are archiving PDFs that contain detailed XMP metadata (e.g., copyright, usage rights, camera settings for scanned images). **Solution (Strategy 1 - Enhanced):** 1. **Extraction:** * Use a library like `pikepdf` that excels at parsing and extracting XMP. * Save the entire XMP XML block for each PDF as a string within your JSON metadata record. 2. **Merging:** Merge the PDFs. The simple `merge-pdf` tool might not handle XMP correctly during merging. The page content is the priority. 3. **Post-processing (Metadata for Merged Doc):** * The Info dictionary metadata of the merged document can be set as per standard policy (e.g., first document's metadata). * **Crucially, the XMP metadata of the merged document itself will be problematic if the merger doesn't support XMP aggregation.** Most simple mergers won't. It's best to consider the merged document's XMP as a "container XMP" with minimal information (or as generated by the merge process), and rely on the *external* metadata for the original rich XMP. 4. **External Metadata:** The JSON file will contain the full, original XMP XML for each constituent PDF. This is where the rich metadata is truly preserved and accessible. ## Global Industry Standards and Best Practices Long-term digital preservation is guided by international standards and best practices. Adhering to these ensures the authenticity, trustworthiness, and long-term accessibility of digital assets. ### 6.1 ISO Standards * **ISO 14721:2012 (OAIS - Open Archival Information System):** This is the foundational model for digital archives. It defines the functional entities, reference model, and core concepts for preserving digital information. Our strategy of extracting metadata and preserving it externally aligns perfectly with the OAIS principle of maintaining PDI (Preservation Description Information), which includes metadata about the object. * **ISO 30300 Series (Management of documentary and information resources):** These standards provide frameworks for records management and information governance, which are crucial for ensuring the integrity and authenticity of digital records throughout their lifecycle. * **ISO 15489 (Records Management):** Principles of records management apply to digital records, emphasizing authenticity, reliability, integrity, and usability. ### 6.2 NISO Standards (National Information Standards Organization) * **NISO RP-10:2014 (Recommended Practice: Identifying the Characteristics of a Digital Forensic Image):** While focused on digital forensics, the principles of creating bit-for-bit exact copies and ensuring data integrity are highly relevant to digital preservation. * **NISO Z39.85:2001 (The Dublin Core Element Set):** A set of fifteen core metadata elements for resource discovery. Many PDFs include Dublin Core metadata, and preserving this is important for interoperability. ### 6.3 Trusted Digital Repositories (TDRs) Organizations like **The Digital Preservation Coalition (DPC)** and **nestor (Network of Expertise in long-term Storage)** provide guidelines for establishing and operating Trusted Digital Repositories. Key principles that guide our metadata strategy include: * **Authenticity:** Ensuring that the digital object is what it purports to be. Preserving original metadata is key to this. * **Integrity:** Ensuring that the digital object has not been altered or corrupted. Hashing and meticulous metadata preservation contribute to this. * **Usability:** Ensuring that the digital object can be accessed and understood. Well-structured metadata is essential for this. * **Provenance:** Maintaining a clear and auditable record of the digital object's origin, history, and ownership. This is precisely what our external metadata strategy achieves. ### 6.4 Best Practices for Metadata Preservation * **Declare Metadata Early and Often:** Capture metadata at the point of creation and ensure it's embedded in the file. * **Use Standard Metadata Schemas:** Employ widely recognized schemas like Dublin Core, XMP, etc. * **Maintain Separate Metadata Records:** For critical preservation, external metadata records are often more robust than relying solely on embedded metadata. * **Regularly Audit Metadata:** Periodically check the integrity and completeness of metadata. * **Migrate Metadata:** As archival systems evolve, ensure metadata can be migrated and remains usable. * **Document Metadata Practices:** Clearly document the policies and procedures for metadata creation, extraction, and preservation. Our recommended strategy, focusing on Strategy 1 (Extraction and External Archiving), directly supports these global standards and best practices by prioritizing the preservation of original metadata in an auditable, structured, and accessible format. ## Multi-language Code Vault To facilitate the implementation of metadata preservation strategies, here's a multilingual code repository demonstrating key concepts. The focus is on Python for its widespread use in scripting and data processing, Java for enterprise environments, and a conceptual Node.js example. ### 8.1 Python Example (Enhanced `pikepdf` for XMP and `PyPDF2` for basic merging) This example uses `pikepdf` to better handle XMP and general PDF object manipulation, alongside `PyPDF2` for page merging. python import pikepdf import PyPDF2 import json import os from datetime import datetime def pdf_date_to_iso(pdf_date_str): """ Converts PDF date string (D:YYYYMMDDHH'MM'SS[+/-]HH'MM') to ISO 8601 format. Handles potential errors and returns None if conversion fails. """ if not pdf_date_str or not pdf_date_str.startswith('D:'): return None date_part = pdf_date_str[2:16] # YYYYMMDDHHMMSS tz_part = pdf_date_str[16:] # [+/-]HH'MM' try: # Basic date parsing year = int(date_part[0:4]) month = int(date_part[4:6]) day = int(date_part[6:8]) hour = int(date_part[8:10]) minute = int(date_part[10:12]) second = int(date_part[12:14]) dt = datetime(year, month, day, hour, minute, second) # Timezone parsing if tz_part: sign = tz_part[0] tz_hours = int(tz_part[1:3]) tz_minutes = int(tz_part[4:6]) # Assuming ' at index 3 and 6 offset_seconds = (tz_hours * 3600 + tz_minutes * 60) * (-1 if sign == '-' else 1) from datetime import timedelta dt = dt.replace(tzinfo=datetime.timezone(timedelta(seconds=offset_seconds))) else: # If no timezone info, treat as naive or UTC depending on policy # For preservation, it's best to record as naive and document assumption. pass # dt remains naive return dt.isoformat() except ValueError as e: print(f"Warning: Could not parse PDF date '{pdf_date_str}': {e}") return None except IndexError as e: print(f"Warning: Malformed PDF date string '{pdf_date_str}': {e}") return None def extract_pdf_metadata(pdf_path): """ Extracts Info dictionary and XMP metadata from a PDF using pikepdf. Returns a dictionary of extracted metadata. """ metadata = { "original_filename": os.path.basename(pdf_path), "file_size_bytes": os.path.getsize(pdf_path) } try: with pikepdf.open(pdf_path) as pdf: # Extract Info Dictionary if pdf.docinfo: for key, value in pdf.docinfo.items(): clean_key = key.lstrip('/') if isinstance(value, pikepdf.String): metadata[clean_key] = str(value) elif isinstance(value, pikepdf.Name): metadata[clean_key] = str(value) else: # Attempt to convert other types if meaningful try: metadata[clean_key] = str(value) except: pass # Ignore if cannot be converted # Extract XMP Metadata if '/Metadata' in pdf.Root: try: xmp_stream = pdf.Root.Metadata.read_bytes() metadata["xmp_data"] = xmp_stream.decode('utf-8', errors='replace') except Exception as e: print(f"Warning: Could not extract XMP stream for {pdf_path}: {e}") # Extract Page Count metadata["page_count"] = len(pdf.pages) except pikepdf.PasswordError: metadata["error"] = "File is password protected." except pikepdf.PdfError as e: metadata["error"] = f"Pikepdf Error: {e}" except Exception as e: metadata["error"] = f"General Error: {e}" return metadata def merge_pdfs_with_preservation(input_pdf_paths, output_pdf_path, metadata_output_json_path): """ Merges PDFs, extracts original metadata, and saves it externally. Sets the merged document's metadata based on the first input file. """ all_original_metadata = [] page_merger = PyPDF2.PdfWriter() # --- 1. Extract Metadata and Add Pages --- for pdf_path in input_pdf_paths: if not os.path.exists(pdf_path): print(f"Skipping non-existent file: {pdf_path}") continue # Extract original metadata original_meta = extract_pdf_metadata(pdf_path) if "error" in original_meta: print(f"Error extracting metadata from {pdf_path}: {original_meta['error']}") # Decide if you want to proceed with merging pages if metadata extraction failed # For preservation, it's often better to stop or log extensively. # Here, we'll still try to add pages if possible. all_original_metadata.append(original_meta) # Add pages for merging try: with open(pdf_path, 'rb') as infile: reader = PyPDF2.PdfReader(infile) for page_num in range(len(reader.pages)): page_merger.add_page(reader.pages[page_num]) except Exception as e: print(f"Error adding pages from {pdf_path}: {e}") # Decide how to handle this - potentially skip the file or stop. if not page_merger.pages: print("No pages were added to the merger. Aborting.") return # --- 2. Set Metadata for the Merged Document (Policy: Use First Document's) --- merged_doc_info = {} if all_original_metadata: first_doc_meta = all_original_metadata[0] # Use Info Dictionary fields from the first document if "Title" in first_doc_meta: merged_doc_info["/Title"] = first_doc_meta["Title"] if "Author" in first_doc_meta: merged_doc_info["/Author"] = first_doc_meta["Author"] if "Subject" in first_doc_meta: merged_doc_info["/Subject"] = first_doc_meta["Subject"] # For dates, use the first document's creation date as the merged document's creation date. # Convert to PDF date format if it was converted to ISO earlier. creation_date_iso = pdf_date_to_iso(first_doc_meta.get("CreationDate")) if creation_date_iso: # Convert ISO back to PDF format for embedding. This is tricky and can lose precision. # A more robust approach is to store the PDF format string directly if available. # Let's try to use the original string if possible, or re-format. if first_doc_meta.get("CreationDate"): merged_doc_info["/CreationDate"] = first_doc_meta["CreationDate"] else: # Fallback if original string was lost/malformed but ISO was parsed try: dt_obj = datetime.fromisoformat(creation_date_iso) # Format D:YYYYMMDDHH'MM'SS[+/-]HH'MM' (naive for simplicity here) merged_doc_info["/CreationDate"] = f"D:{dt_obj.strftime('%Y%m%d%H%M%S')}" except: merged_doc_info["/CreationDate"] = f"D:{datetime.now().strftime('%Y%m%d%H%M%S')}" # Last resort else: # If no original creation date, use current time merged_doc_info["/CreationDate"] = f"D:{datetime.now().strftime('%Y%m%d%H%M%S')}" # Always set a modification date for the merge operation merged_doc_info["/ModDate"] = f"D:{datetime.now().strftime('%Y%m%d%H%M%S')}" merged_doc_info["/Producer"] = "Preservation Merging Script (Python)" merged_doc_info["/Creator"] = "Preservation Merging Script (Python)" # Optional, indicates script's role # --- 3. Write the Merged PDF with its own metadata --- try: with open(output_pdf_path, 'wb') as outfile: page_merger.add_information(merged_doc_info) # Apply metadata to the output writer page_merger.write(outfile) print(f"Successfully merged PDFs to: {output_pdf_path}") except Exception as e: print(f"Error writing merged PDF {output_pdf_path}: {e}") return # Exit if merged PDF cannot be written # --- 4. Save External Original Metadata --- # Filter out any entries that might have failed completely (e.g., only 'error' key) valid_metadata_entries = [meta for meta in all_original_metadata if "error" not in meta or meta.get("page_count", 0) > 0] try: with open(metadata_output_json_path, 'w', encoding='utf-8') as mdfile: json.dump(valid_metadata_entries, mdfile, indent=4, ensure_ascii=False) print(f"Original metadata saved to: {metadata_output_json_path}") except Exception as e: print(f"Error saving external metadata to {metadata_output_json_path}: {e}") # --- Example Usage --- if __name__ == "__main__": # Create dummy PDFs for demonstration purposes # This requires an external tool or a more complex script. # For this example, we'll assume 'doc_a.pdf', 'doc_b.pdf', 'doc_c.pdf' exist. # Ensure they have different creation dates and potentially authors. # Placeholder for actual PDF files. # In a real scenario, you would have these files. # Example: # Create dummy PDFs with pikepdf: # doc_a.pdf: Title="Report A", Author="Alice", CreationDate="D:20200115100000+01'00'" # doc_b.pdf: Title="Report B", Author="Bob", CreationDate="D:20210322113000+00'00'" # doc_c.pdf: Title="Report C", Author="Charlie", CreationDate="D:20220701140000-05'00'" # --- Generating Dummy PDFs for the example --- print("Generating dummy PDFs for demonstration...") def create_dummy_pdf(filename, title, author, creation_date_str, content=""): try: pdf = pikepdf.new() pdf.add_blank_page() page = pdf.pages[0] if content: page.text(**{"font-size": 12, "text": content}) # Set Info Dictionary metadata pdf.docinfo["/Title"] = title pdf.docinfo["/Author"] = author pdf.docinfo["/CreationDate"] = creation_date_str pdf.docinfo["/Producer"] = "Dummy PDF Generator" # Add some dummy XMP metadata xmp_template = f""" {title} {author} {creation_date_str.replace('D:','')} {creation_date_str.replace('D:','').replace("'",'')[:15]} {creation_date_str.replace('D:','').replace("'",'')[:15]} Dummy PDF Generator """ pdf.add_metadata(xmp_template) pdf.save(filename) print(f"Created: {filename}") except Exception as e: print(f"Error creating dummy PDF {filename}: {e}") dummy_files = [ ("doc_a.pdf", "Report A", "Alice", "D:20200115100000+01'00'", "Content of Document A."), ("doc_b.pdf", "Report B", "Bob", "D:20210322113000+00'00'", "Content of Document B."), ("doc_c.pdf", "Report C", "Charlie", "D:20220701140000-05'00'", "Content of Document C (with timezone).") ] for fname, title, author, cdate, content in dummy_files: create_dummy_pdf(fname, title, author, cdate, content) # --- Actual Merging Process --- input_pdf_files = ["doc_a.pdf", "doc_b.pdf", "doc_c.pdf"] output_merged_pdf = "preserved_archive_python.pdf" output_metadata_json = "preserved_archive_python_metadata.json" # Ensure input files exist (created above) if all(os.path.exists(f) for f in input_pdf_files): merge_pdfs_with_preservation(input_pdf_files, output_merged_pdf, output_metadata_json) else: print("Dummy PDF creation failed. Please check errors and run again.") print("\n--- Python Example Notes ---") print(f"1. Open '{output_merged_pdf}' and check its File > Properties.") print(" - Title, Author, CreationDate will likely reflect 'doc_a.pdf' (the first file).") print(" - ModDate will be set to the time of the merge operation.") print(f"2. Open '{output_metadata_json}' to see the original, preserved metadata for each document.") print(" - This JSON file is the authoritative record.") ### 8.2 Java Example (Using Apache PDFBox) Apache PDFBox is a mature and widely used Java library for working with PDF documents. java import org.apache.pdfbox.multipdf.PDFMergerUtility; import org.apache.pdfbox.pdmodel.PDDocument; import org.apache.pdfbox.pdmodel.PDDocumentInformation; import org.apache.pdfbox.pdmodel.PDMetadata; import org.apache.xmpbox.XMPMetadata; import org.apache.xmpbox.xml.XmpSerializer; import org.w3c.dom.Document; import java.io.File; import java.io.IOException; import java.io.OutputStreamWriter; import java.nio.charset.StandardCharsets; import java.nio.file.Files; import java.nio.file.Paths; import java.text.SimpleDateFormat; import java.util.ArrayList; import java.util.Date; import java.util.List; import java.util.TimeZone; import com.fasterxml.jackson.databind.ObjectMapper; import com.fasterxml.jackson.databind.SerializationFeature; public class PdfMetadataPreserver { // Helper class to store metadata public static class OriginalMetadata { public String originalFilename; public long fileSize; public String title; public String author; public String creationDate; // PDF Date format D:YYYYMMDDHH'MM'SS[+/-]HH'MM' public String modDate; public String xmpData; // Raw XMP XML string public int pageCount; public String error; // For error reporting } /** * Extracts Info Dictionary and XMP metadata from a PDF. * @param pdfFile The PDF file to process. * @return A list containing one OriginalMetadata object. */ public static List extractMetadata(File pdfFile) { List metadataList = new ArrayList<>(); OriginalMetadata meta = new OriginalMetadata(); meta.originalFilename = pdfFile.getName(); meta.fileSize = pdfFile.length(); try (PDDocument document = PDDocument.load(pdfFile)) { PDDocumentInformation info = document.getDocumentInformation(); meta.title = info.getTitle(); meta.author = info.getAuthor(); meta.creationDate = info.getCreationDate() != null ? pdfDateToString(info.getCreationDate()) : null; meta.modDate = info.getModificationDate() != null ? pdfDateToString(info.getModificationDate()) : null; meta.pageCount = document.getNumberOfPages(); // Extract XMP Metadata PDMetadata pdMetadata = document.getMetadata(); if (pdMetadata != null && pdMetadata.getContentStream() != null) { try { // Using a custom serializer to get raw XML string XMPMetadata xmpMetadata = pdMetadata.getXMPMetadata(); if (xmpMetadata != null) { XmpSerializer serializer = new XmpSerializer(); StringWriter sw = new StringWriter(); serializer.serialize(xmpMetadata, sw, true); // 'true' for pretty print meta.xmpData = sw.toString(); } } catch (Exception e) { meta.error = "Error extracting XMP: " + e.getMessage(); System.err.println("Error extracting XMP for " + pdfFile.getName() + ": " + e.getMessage()); } } } catch (org.apache.pdfbox.pdmodel.encryption.InvalidPasswordException e) { meta.error = "File is password protected."; } catch (IOException e) { meta.error = "IO Error: " + e.getMessage(); System.err.println("IO Error processing " + pdfFile.getName() + ": " + e.getMessage()); } catch (Exception e) { meta.error = "Unexpected error: " + e.getMessage(); e.printStackTrace(); } metadataList.add(meta); return metadataList; } /** * Converts Java Date to PDF Date string format (D:YYYYMMDDHH'MM'SS[+/-]HH'MM'). * Note: PDFBox's getCreationDate() returns a PDFDate object, which we convert to Java Date. * This method converts Java Date *back* to the PDF String format. */ private static String pdfDateToString(org.apache.pdfbox.pdmodel.fix.PDFDate pdfDate) { if (pdfDate == null) return null; // PDFBox's PDFDate has a method to get the string directly, which is more reliable return pdfDate.toString(); } // Overload for Java Date object for convenience if needed private static String dateToString(Date date) { if (date == null) return null; SimpleDateFormat sdf = new SimpleDateFormat("yyyyMMddHH'mm'ss"); // PDF specification requires timezone offset. For simplicity, we'll assume UTC or naive. // A proper implementation would capture timezone info. // For now, let's format without explicit timezone in the string, and let PDFBox handle it. return "D:" + sdf.format(date); // This is a simplification, a full implementation is complex. } /** * Merges multiple PDFs and preserves original metadata externally. * @param inputPdfFiles List of input PDF files. * @param outputPdfPath Path for the merged PDF. * @param metadataOutputJsonPath Path for the external metadata JSON file. * @throws IOException if an I/O error occurs. */ public static void mergeAndPreserveMetadata(List inputPdfFiles, File outputPdfPath, File metadataOutputJsonPath) throws IOException { List allOriginalMetadata = new ArrayList<>(); PDFMergerUtility pdfMerger = new PDFMergerUtility(); pdfMerger.setDestinationFileName(outputPdfPath.getAbsolutePath()); PDDocument mergedDocument = null; try { // --- 1. Extract Metadata and Add Pages --- for (File pdfFile : inputPdfFiles) { if (!pdfFile.exists()) { System.err.println("Skipping non-existent file: " + pdfFile.getAbsolutePath()); continue; } // Extract original metadata List extracted = extractMetadata(pdfFile); if (extracted != null && !extracted.isEmpty()) { OriginalMetadata meta = extracted.get(0); allOriginalMetadata.add(meta); if (meta.error == null) { // Only add pages if metadata extraction was successful (or at least no critical error) pdfMerger.addSource(pdfFile); } else { System.err.println("Skipping pages for " + pdfFile.getName() + " due to metadata extraction error."); } } } if (pdfMerger.getSources().isEmpty()) { System.err.println("No valid PDF sources to merge. Aborting."); return; } // --- 2. Perform the merge --- pdfMerger.mergeDocuments(null); // null for callback // --- 3. Set Metadata for the Merged Document (Policy: Use First Document's) --- // Re-open the merged document to set its metadata mergedDocument = PDDocument.load(outputPdfPath); PDDocumentInformation mergedInfo = mergedDocument.getDocumentInformation(); if (!allOriginalMetadata.isEmpty()) { OriginalMetadata firstDocMeta = allOriginalMetadata.get(0); // Use Info Dictionary fields from the first document if (firstDocMeta.title != null) mergedInfo.setTitle(firstDocMeta.title); if (firstDocMeta.author != null) mergedInfo.setAuthor(firstDocMeta.author); // Note: PDFBox's setCreationDate expects a java.util.Date. // Converting the PDF string back to Date can be complex and lose timezone info. // We'll use the original PDF Date string if possible, or fall back. if (firstDocMeta.creationDate != null) { // Attempt to parse the PDF date string back into a Date object. // This is a simplified parsing and might fail for complex timezone formats. try { // Use PDFBox's internal parser if available or a custom one. // For simplicity, we'll try to get the PDFDate object from the first file's metadata // and use that. This requires re-opening the first file or storing its PDDocumentInformation. // A more direct approach: Store the java.util.Date object during initial extraction. // For now, we'll use a fallback to current date if parsing is complex. // The key is that the *external* metadata has the original. SimpleDateFormat pdfDateParser = new SimpleDateFormat("yyyyMMddHH'mm'ss"); // Assuming the PDFDate string was D:YYYYMMDDHH'MM'SS etc. // This parsing is highly fragile. // Let's rely on the external JSON for true original dates. // For the merged doc's info, we'll set a reasonable value. // Fallback to current date if original parsing fails or is complex. mergedInfo.setCreationDate(new Date()); // If you had extracted the java.util.Date object: mergedInfo.setCreationDate(extractedDateObject); } catch (Exception e) { System.err.println("Could not parse original creation date for merged doc: " + e.getMessage()); mergedInfo.setCreationDate(new Date()); // Fallback } } else { mergedInfo.setCreationDate(new Date()); // Fallback if no original date } // Always set a modification date for the merge operation mergedInfo.setModificationDate(new Date()); mergedInfo.setProducer("Preservation Merging Script (Java)"); mergedInfo.setCreator("Preservation Merging Script (Java)"); } else { // If no metadata extracted, set basic info mergedInfo.setCreationDate(new Date()); mergedInfo.setModificationDate(new Date()); mergedInfo.setProducer("Preservation Merging Script (Java)"); } // Save the merged document with updated information mergedDocument.save(outputPdfPath); System.out.println("Successfully merged PDFs to: " + outputPdfPath.getAbsolutePath()); } finally { if (mergedDocument != null) { mergedDocument.close(); } } // --- 4. Save External Original Metadata --- // Filter out any entries that might have failed completely List validMetadataEntries = new ArrayList<>(); for (OriginalMetadata meta : allOriginalMetadata) { if (meta.error == null || meta.pageCount > 0) { // Include if no error or at least pages were processed validMetadataEntries.add(meta); } } ObjectMapper mapper = new ObjectMapper(); mapper.enable(SerializationFeature.INDENT_OUTPUT); mapper.writeValue(metadataOutputJsonPath, validMetadataEntries); System.out.println("Original metadata saved to: " + metadataOutputJsonPath.getAbsolutePath()); } public static void main(String[] args) { // Example Usage: // Create dummy PDFs for demonstration purposes (requires a separate method or manual creation) // For this example, we'll assume 'doc_a.pdf', 'doc_b.pdf', 'doc_c.pdf' exist. // These dummy PDFs should be created with specific metadata. // In a real application, you would have these files. // --- Creating Dummy PDFs (Simplified - Requires manual creation or more complex code) --- // Creating PDFs with specific metadata using PDFBox is possible but verbose. // For brevity, we'll skip the actual dummy PDF creation here. // Assume 'doc_a.pdf', 'doc_b.pdf', 'doc_c.pdf' are present in the project root. // Example call to create one: // createDummyPdf("doc_a.pdf", "Report A", "Alice", new Date(), "Content A"); System.out.println("Please ensure 'doc_a.pdf', 'doc_b.pdf', 'doc_c.pdf' exist in the project root."); System.out.println("These dummy PDFs should be created with specific metadata for testing."); List inputFiles = new ArrayList<>(); inputFiles.add(new File("doc_a.pdf")); inputFiles.add(new File("doc_b.pdf")); inputFiles.add(new File("doc_c.pdf")); File outputFile = new File("preserved_archive_java.pdf"); File metadataFile = new File("preserved_archive_java_metadata.json"); try { mergeAndPreserveMetadata(inputFiles, outputFile, metadataFile); } catch (IOException e) { e.printStackTrace(); } System.out.println("\n--- Java Example Notes ---"); System.out.println("1. Open '" + outputFile.getName() + "' and check its File > Properties."); System.out.println(" - Title, Author will likely reflect 'doc_a.pdf'."); System.out.println(" - CreationDate and ModDate will be set based on the merge operation."); System.out.println("2. Open '" + metadataFile.getName() + "' to see the original, preserved metadata."); System.out.println(" - This JSON file is the authoritative record."); } // --- Dummy PDF Creation Helper (Conceptual) --- // This is a simplified example. A full implementation involves more detailed PDFBox API usage. /* public static void createDummyPdf(String filename, String title, String author, Date creationDate, String content) throws IOException { try (PDDocument document = new PDDocument()) { document.addPage(new PDPage()); PDDocumentInformation info = document.getDocumentInformation(); info.setTitle(title); info.setAuthor(author); info.setCreationDate(creationDate); // This takes java.util.Date info.setModificationDate(creationDate); // For dummy, same as creation info.setProducer("Dummy PDF Generator"); // Add XMP metadata (complex, requires xmpbox library) // ... // Add content (simple text for demonstration) PDPage page = document.getPage(0); PDPageContentStream contentStream = new PDPageContentStream(document, page); contentStream.beginText(); contentStream.setFont(PDType1Font.HELVETICA, 12); contentStream.newLineAtOffset(100, 700); contentStream.showText(content); contentStream.endText(); contentStream.close(); document.save(filename); System.out.println("Created dummy PDF: " + filename); } } */ } ### 8.3 Node.js Example (Using `pdf-lib` and `fs`) `pdf-lib` is a popular JavaScript library for PDF manipulation. javascript const { PDFDocument, StandardFonts, rgb } = require('pdf-lib'); const fs = require('fs').promises; const path = require('path'); /** * Extracts Info Dictionary and XMP metadata from a PDF. * Note: pdf-lib's direct XMP extraction is limited. This example focuses on Info. * For robust XMP, you might need a more specialized library or a server-side tool. * @param {string} pdfPath Path to the PDF file. * @returns {Promise} A promise that resolves to an object with extracted metadata. */ async function extractPdfMetadata(pdfPath) { const metadata = { original_filename: path.basename(pdfPath), file_size_bytes: (await fs.stat(pdfPath)).size }; try { const pdfBytes = await fs.readFile(pdfPath); const pdfDoc = await PDFDocument.load(pdfBytes); // Extract Info Dictionary const info = pdfDoc.getInfo(); if (info) { metadata.title = info.getTitle(); metadata.author = info.getAuthor(); metadata.subject = info.getSubject(); metadata.keywords = info.getKeywords(); metadata.creator = info.getCreator(); metadata.producer = info.getProducer(); // pdf-lib's info.getCreationDate() and .getModificationDate() return JS Date objects metadata.creationDate = info.getCreationDate()?.toISOString() || null; metadata.modDate = info.getModificationDate()?.toISOString() || null; } metadata.page_count = pdfDoc.getPageCount(); // XMP Metadata: pdf-lib does not have direct robust XMP parsing/writing. // This part would require an external tool or library. // For this example, we'll leave it as null or indicate it's not supported. metadata.xmp_data = "[XMP extraction not directly supported by pdf-lib in this example]"; } catch (error) { metadata.error = error.message; console.error(`Error processing ${pdfPath}: ${error.message}`); } return metadata; } /** * Merges PDFs and preserves original metadata externally. * Sets the merged document's metadata based on the first input file. * @param {string[]} inputPdfPaths Array of paths to input PDF files. * @param {string} outputPdfPath Path for the merged PDF. * @param {string} metadataOutputJsonPath Path for the external metadata JSON file. */ async function mergeAndPreserveMetadata(inputPdfPaths, outputPdfPath, metadataOutputJsonPath) { const allOriginalMetadata = []; const mergedDoc = await PDFDocument.create(); const font = await mergedDoc.embedFont(StandardFonts.Helvetica); // --- 1. Extract Metadata and Add Pages --- for (const pdfPath of inputPdfPaths) { if (!await fs.access(pdfPath).then(() => true).catch(() => false)) { console.warn(`Skipping non-existent file: ${pdfPath}`); continue; } const originalMeta = await extractPdfMetadata(pdfPath); allOriginalMetadata.push(originalMeta); if (!originalMeta.error) { try { const pdfBytes = await fs.readFile(pdfPath); const donorDoc = await PDFDocument.load(pdfBytes); const copiedPages = await mergedDoc.copyPages(donorDoc, donorDoc.getPageIndices()); copiedPages.forEach(page => mergedDoc.addPage(page)); } catch (error) { console.error(`Error adding pages from ${pdfPath}: ${error.message}`); // Decide how to handle this - potentially skip the file or stop. } } else { console.warn(`Skipping pages for ${path.basename(pdfPath)} due to metadata extraction error.`); } } if (mergedDoc.getPageCount() === 0) { console.error("No pages were added to the merger. Aborting."); return; } // --- 2. Set Metadata for the Merged Document (Policy: Use First Document's) --- const merged