Literature Module

This module as the name suggests is interested in the papers. There are 2 classes, LitSearch for searching pubmed and arxiv and Paper for using the found ids in LitSearch or from other resources (see apis) and look for papers and process them in a machine readable way. We will dive into these one by one

Literature Module

A module for searching and processing scientific literature from PubMed and arXiv, with functionality to download papers, extract content, and analyze metadata.

Classes Overview

LitSearch: Search for papers across scientific databases
Paper: Download and process individual papers, extracting text, figures, and tables
PaperInfo: A simple dataclass that contains all the data including text and figure for the paper, this is used to contain the data and will also be used to push the data to the knowledgebase.

LitSearch

The LitSearch class provides methods to search PubMed and arXiv databases.

Usage

from mnemosyne.literature.literature import LitSearch

# Initialize searcher (optional PubMed API key)
searcher = LitSearch(pubmed_api_key="your_api_key")  # API key optional

# Search PubMed
pubmed_ids = searcher.search(
    query="AI usage in medicine",
    database="pubmed",
    results="id",     # Return PMIDs
    max_results=1000  # Max number of results to return
)

# Search with DOIs
dois = searcher.search(
    query="AI in medicine", 
    database="pubmed",
    results="doi"     # Return DOIs instead of PMIDs
)

# Search arXiv
arxiv_ids = searcher.search(
    query="machine learning genomics",
    database="arxiv"
)

Paper

The Paper class handles downloading and processing individual papers.

Usage

from mnemosyne.literature.literature import Paper

# Initialize from PubMed ID
paper = Paper(
    paper_id="12345678",
    id_type="pubmed",
    citations=True,      # Get citation data
    references=True,     # Get reference data
    related_works=True   # Get related papers
)

# Initialize from arXiv ID 
paper = Paper(
    paper_id="2101.12345",
    id_type="arxiv"
)

# Initialize from local PDF file
paper = Paper(
    paper_id=None,
    filepath="/path/to/paper.pdf"
)

# Get paper abstract
abstract = paper.get_abstract()

# Download PDF
pdf_path = paper.download(destination="/downloads/")

# Process paper content
paper.process()  # Extracts text, figures, tables

# Access processed content
print(paper.text)              # Full text
print(paper.figures)           # Extracted figures
print(paper.tables)            # Extracted tables
print(paper.paper_info)        # Metadata from OpenAlex

Key Features

Paper Search

Search PubMed and arXiv databases
Return paper IDs or DOIs
Configurable result limits

Paper Processing

Download PDFs from open access sources
Extract paper abstract
Extract full text content
Extract figures and tables
Get paper metadata (title, authors, etc.)
Get citation data
Get reference data
Get related works
Generate embeddings from chunked full text
Generate embeddings from figures and tables
Generate figure interpretation text using vision language models

Notes

Requires an active internet connection for searching and downloading
Some features require paper IDs and won’t work with local PDFs only
PDF processing requires local storage for downloaded files
Citations, references, and related works data comes from OpenAlex
Not all papers may be available for download (depends on open access status and whether there is a direct pdf link)