Span Categorization Demo: Glass Transition Temperature#

In this notebook, we demonstrate an example of training ChemREL to identify and extract a new chemical property, the glass transition temperature $(T_g)$, on polymer compounds.

Tip

To run a copy of this notebook yourself, download the corresponding spancat_demo.ipynb file here.

Setup#

In this demo, we will train a new Tok2Vec span categorization model to label polymers and glass transition temperature values in text extracted from research literature.

Before beginning the demo, ensure that ChemREL is properly installed and that your command line is focused to the ChemREL Initial Directory you configured when first installing the package.

Data Preparation#

Before labeling any data, we will first need to source the data from research texts. To this end, we will extract sample data from a paper hosted on Elsevier. Alternatively, you may supply your own data in PDF form and run the chemrel aux extract-paper command instead. This demo will use the following text as an example data source.

https://doi.org/10.1016/j.nocx.2022.100084

To download hosted papers from Elsevier using ChemREL, you will need an Elsevier API key. If you do not have one already, request a key at the Elsevier Developer Portal.

Once you have obtained a key, replace [API Key] with your personal key, and run the following command to generate a JSONL data file from the chosen paper.

!chemrel aux extract-elsevier-paper 10.1016/j.nocx.2022.100084 [API Key] ./assets/tg_data.jsonl

Labeling with Prodigy#

Now that we have generated our JSONL file tg_data.jsonl containing the necessary data from our paper, it’s time to label the property/value spans found in the text. For labelling spans, we recommend using Prodigy, an easy-to-use data annotation tool. While using Prodigy is not required, note that ChemREL expects all training data to conform to Spacy’s binary data formats. If using another annotation strategy, be sure that all data fed into ChemREL is in this format.

Prodigy Installation#

After obtaining a Prodigy license, you can install the Prodigy PyPI package here. It’s recommended that you do so in a virtual environment for ease of management.

Once you have installed Prodigy or another data annotation tool, proceed below. From this point forward, we will assume that the virtual environment in which Prodigy is installed is active, and that the prodigy command is usable in the command line.

Annotating Spans#

We will now annotate polymer compound names and their corresponding glass transition temperatures in the extracted tg_data.jsonl file. We will assign polymer compound names and transition temperature values the labels POLYMER and TG, respectively, and save the annotations to a new Prodigy dataset tg. To do so, run the following command.

Note: The command can be further customized as appropriate according to the Prodigy spans recipe documentation.

!python -m prodigy spans.manual tg blank:en assets/tg_data.jsonl --label POLYMER,TG

Using 2 label(s): POLYMER, TG
Added dataset tg to database SQLite.

✨  Starting the web server at http://localhost:8080 ...
Open the app in your browser and start annotating!

^C

✔ Saved 185 annotations to database SQLite
Dataset: tg
Session ID: 2024-04-15_08-36-39

Now, open the web server URL outputted above, and begin highlighting the polymer and glass transition temperature spans according to their corresponding labels. Once all data samples have been labeled, save the annotations with the key command Ctrl-S or Cmd-S as appropriate, and interrupt the kernel to end the annotation session.

For a more detailed reference on the Prodigy annotation process, see the Prodigy span categorization documentation here.

Next, to generate a Spacy binary data file, we will run Prodigy’s data-to-spacy command to generate training and development dataset files, or train.spacy and dev.spacy, respectively, and save them to ChemREL’s scdata directory. For this example, we have opted to use the tok2vec model and have thus selected the available sc_tok2vec.cfg config file.

Note: To define a custom evaluation split or add other constraints, see the data-to-spacy command reference.

!python -m prodigy data-to-spacy ./scdata --spancat tg --config ./configs/sc_tok2vec.cfg

ℹ Using language 'en'

============================== Generating data ==============================
Components: spancat
Merging training and evaluation data for 1 components
  - [spancat] Training: 148 | Evaluation: 37 (20% split)
Training: 80 | Evaluation: 22
Labels: spancat (2)
✔ Saved 80 training examples
scdata/train.spacy
✔ Saved 22 evaluation examples
scdata/dev.spacy

============================= Generating config =============================
✔ Generated training config

======================== Generating cached label data ========================
✔ Saving label data for component 'spancat'
scdata/labels/spancat.json

============================= Finalizing export =============================
✔ Saved training config
scdata/config.cfg

To use this data for training with spaCy, you can run:
python -m spacy train scdata/config.cfg --paths.train scdata/train.spacy --paths.dev scdata/dev.spacy

Training a New Model#

Now that we have generated our binary Spacy data files, it’s time to train a new ChemREL Tok2Vec span categorizer model from our annotations. To do so, we will run ChemREL’s span train-cpu command on the data files and config file we generated, as follows.

Note: To end training prematurely, terminate the kernel.

!chemrel span train-cpu --tok2vec-config ./scdata/config.cfg

ℹ Saving to output directory: sctraining
ℹ Using CPU

=========================== Initializing pipeline ===========================
[2024-04-15 08:42:35,253] [INFO] Set up nlp object from config
[2024-04-15 08:42:35,261] [INFO] Pipeline: ['spancat']
[2024-04-15 08:42:35,264] [INFO] Created vocabulary
[2024-04-15 08:42:35,264] [INFO] Finished initializing nlp object
[2024-04-15 08:42:35,362] [INFO] Initialized pipeline components: ['spancat']
✔ Initialized pipeline

============================= Training pipeline =============================
ℹ Pipeline: ['spancat']
ℹ Initial learn rate: 0.0005
E    #       LOSS SPANCAT  SPANS_SC_F  SPANS_SC_P  SPANS_SC_R  SCORE 
---  ------  ------------  ----------  ----------  ----------  ------
  0       0        501.80        0.00        0.00        0.00    0.00
 10     200       2531.60        0.00        0.00        0.00    0.00
 23     400        219.50       80.00      100.00       66.67    0.80
 39     600         25.04      100.00      100.00      100.00    1.00
 60     800          5.83       80.00      100.00       66.67    0.80
 84    1000          2.10      100.00      100.00      100.00    1.00
114    1200          2.51      100.00      100.00      100.00    1.00
151    1400          0.53      100.00      100.00      100.00    1.00
196    1600          0.17      100.00      100.00      100.00    1.00
249    1800          0.15      100.00      100.00      100.00    1.00
316    2000          0.13      100.00      100.00      100.00    1.00
394    2200          0.05      100.00      100.00      100.00    1.00
494    2400          0.07      100.00      100.00      100.00    1.00
^C

Aborted!

After training is complete, the best and last trained model will be saved in the model-best and model-last folders, respectively, within the sctraining directory.

Generating Predictions#

Now that we have trained a new model, we can load the model to generate predictions on unseen text. To do so, we reference the trained model file and invoke ChemREL’s predict span command, as follows.

!chemrel predict span ./sctraining/model-best "The polymer Ge2Sb2Te5 had transition temperature Tg = 398 K"

┏━━━┳━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━┓
┃ # ┃ Span      ┃ Label   ┃ Confidence ┃
┡━━━╇━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━┩
│ 1 │ Ge2Sb2Te5 │ POLYMER │ 0.99874985 │
│ 2 │ 398 K     │ TG      │ 0.9999907  │
└───┴───────────┴─────────┴────────────┘

Alternatively, the prediction functionality can be invoked via code by importing the chemrel.functions.predict submodule, as follows.

from chemrel.functions import predict

predict.predict_span("sctraining/model-best", "The polymer Ge2Sb2Te5 had transition temperature Tg = 398 K")

{'POLYMER': [('Ge2Sb2Te5', 0.99874985)], 'TG': [('398 K', 0.9999907)]}

Nice work! You have successfully trained your first ChemREL extraction model. To view the full CLI documentation for ChemREL, and to learn about ChemREL’s additional functionality such as how to train relation extraction and transfer learning models, see the CLI Reference page here.