functions.auxiliary module#

class functions.auxiliary.Paper(filename, doi)#

Bases: object

A class to represent research papers.

build_text(subtext, char_limit)#

Appends provided subtext to Paper text content.

Parameters:
  • subtext (list) – Subtext to append to text content.

  • char_limit (int, optional) – Character limit of each text chunk to be appended.

Return type:

None

write_to_jsonl(jsonl_path)#

Outputs text content to a sequence of JSONL files each corresponding to a text chunk, where each JSONL line is tokenized by sentence. Example: if provided path is dir/file.jsonl and the Paper text contains two chunks, files dir/file_1.jsonl and dir/file_2.jsonl will be generated; otherwise, if the Paper text contains one chunk, dir/file.jsonl will be generated.

Parameters:

jsonl_path (str) – Filepath to save JSONL files to, ignores filename extension.

Return type:

None

functions.auxiliary.extract_paper(paper_path, char_limit=None)#

Converts paper PDF at specified path into a Paper object.

Parameters:
  • paper_path (str) – File path of paper PDF.

  • char_limit (int, optional) – Character limit of each text chunk in generated Paper object, default is None.

Returns:

Paper object containing text from specified paper PDF, chunked by character limit.

Return type:

auxiliary.Paper

functions.auxiliary.find_doi(raw_paper)#

Attempts to find DOI link in paper. Relies on assumption that DOI is present within the first page of the paper.

Parameters:

raw_paper (Document) – Raw fitz.Document object.

Returns:

URL link of paper DOI found on first page of paper, or "DOI NOT FOUND" if no DOI found.

Return type:

str

functions.auxiliary.get_elsevier_paper(doi_code, api_key, char_limit=None)#

Converts Elsevier paper with specified DOI code into a Paper object.

Parameters:
  • doi_code (str) – DOI code of paper, not in URL form.

  • api_key (str) – Elsevier API key.

  • char_limit (int, optional) – Character limit of each text chunk in generated Paper object, default is None.

Returns:

Paper object containing text from specified Elsevier paper with given DOI, chunked by character limit.

Return type:

auxiliary.Paper