Literature Module

Literature Module

This module as the name suggests is interested in the papers. There are 2 classes, LitSearch for searching pubmed and arxiv and Paper for using the found ids in LitSearch or from other resources (see apis) and look for papers and process them in a machine readable way. We will dive into these one by one

Literature Module

A module for searching and processing scientific literature from PubMed and arXiv, with functionality to download papers, extract content, and analyze metadata.

Classes Overview

  • LitSearch: Search for papers across scientific databases
  • Paper: Download and process individual papers, extracting text, figures, and tables
  • PaperInfo: A simple dataclass that contains all the data including text and figure for the paper, this is used to contain the data and will also be used to push the data to the knowledgebase.

LitSearch

The LitSearch class provides methods to search PubMed and arXiv databases.

Usage

from mnemosyne.literature.literature import LitSearch

# Initialize searcher (optional PubMed API key)
searcher = LitSearch(pubmed_api_key="your_api_key")  # API key optional

# Search PubMed
pubmed_ids = searcher.search(
    query="AI usage in medicine",
    database="pubmed",
    results="id",     # Return PMIDs
    max_results=1000  # Max number of results to return
)

# Search with DOIs
dois = searcher.search(
    query="AI in medicine", 
    database="pubmed",
    results="doi"     # Return DOIs instead of PMIDs
)

# Search arXiv
arxiv_ids = searcher.search(
    query="machine learning genomics",
    database="arxiv"
)

Paper

The Paper class handles downloading and processing individual papers.

Usage

from mnemosyne.literature.literature import Paper

# Initialize from PubMed ID
paper = Paper(
    paper_id="12345678",
    id_type="pubmed",
    citations=True,      # Get citation data
    references=True,     # Get reference data
    related_works=True   # Get related papers
)

# Initialize from arXiv ID 
paper = Paper(
    paper_id="2101.12345",
    id_type="arxiv"
)

# Initialize from local PDF file
paper = Paper(
    paper_id=None,
    filepath="/path/to/paper.pdf"
)

# Get paper abstract
abstract = paper.get_abstract()

# Download PDF
pdf_path = paper.download(destination="/downloads/")

# Process paper content
paper.process()  # Extracts text, figures, tables

# Access processed content
print(paper.text)              # Full text
print(paper.figures)           # Extracted figures
print(paper.tables)            # Extracted tables
print(paper.paper_info)        # Metadata from OpenAlex

Key Features

  • Search PubMed and arXiv databases
  • Return paper IDs or DOIs
  • Configurable result limits

Paper Processing

  • Download PDFs from open access sources
  • Extract paper abstract
  • Extract full text content
  • Extract figures and tables
  • Get paper metadata (title, authors, etc.)
  • Get citation data
  • Get reference data
  • Get related works
  • Generate embeddings from chunked full text
  • Generate embeddings from figures and tables
  • Generate figure interpretation text using vision language models

Notes

  • Requires an active internet connection for searching and downloading
  • Some features require paper IDs and won’t work with local PDFs only
  • PDF processing requires local storage for downloaded files
  • Citations, references, and related works data comes from OpenAlex
  • Not all papers may be available for download (depends on open access status and whether there is a direct pdf link)

© 2025 Mnemosyne. All rights reserved.