Span Categorization Demo: Glass Transition Temperature#
In this notebook, we demonstrate an example of training ChemREL to identify and extract a new chemical property, the glass transition temperature $(T_g)$, on polymer compounds.
Tip
To run a copy of this notebook yourself, download the corresponding spancat_demo.ipynb
file here.
Setup#
In this demo, we will train a new Tok2Vec span categorization model to label polymers and glass transition temperature values in text extracted from research literature.
Before beginning the demo, ensure that ChemREL is properly installed and that your command line is focused to the ChemREL Initial Directory you configured when first installing the package.
Data Preparation#
Before labeling any data, we will first need to source the data from research texts. To this end, we will extract sample data from a paper hosted on Elsevier. Alternatively, you may supply your own data in PDF form and run the chemrel aux extract-paper
command instead. This demo will use the following text as an example data source.
To download hosted papers from Elsevier using ChemREL, you will need an Elsevier API key. If you do not have one already, request a key at the Elsevier Developer Portal.
Once you have obtained a key, replace [API Key]
with your personal key, and run the following command to generate a JSONL data file from the chosen paper.
!chemrel aux extract-elsevier-paper 10.1016/j.nocx.2022.100084 [API Key] ./assets/tg_data.jsonl
Labeling with Prodigy#
Now that we have generated our JSONL file tg_data.jsonl
containing the necessary data from our paper, it’s time to label the property/value spans found in the text. For labelling spans, we recommend using Prodigy, an easy-to-use data annotation tool. While using Prodigy is not required, note that ChemREL expects all training data to conform to Spacy’s binary data formats. If using another annotation strategy, be sure that all data fed into ChemREL is in this format.
Prodigy Installation#
After obtaining a Prodigy license, you can install the Prodigy PyPI package here. It’s recommended that you do so in a virtual environment for ease of management.
Once you have installed Prodigy or another data annotation tool, proceed below. From this point forward, we will assume that the virtual environment in which Prodigy is installed is active, and that the prodigy
command is usable in the command line.
Annotating Spans#
We will now annotate polymer compound names and their corresponding glass transition temperatures in the extracted tg_data.jsonl
file. We will assign polymer compound names and transition temperature values the labels POLYMER
and TG
, respectively, and save the annotations to a new Prodigy dataset tg
. To do so, run the following command.
Note: The command can be further customized as appropriate according to the Prodigy spans recipe documentation.
!python -m prodigy spans.manual tg blank:en assets/tg_data.jsonl --label POLYMER,TG
Using 2 label(s): POLYMER, TG
Added dataset tg to database SQLite.
✨ Starting the web server at http://localhost:8080 ...
Open the app in your browser and start annotating!
^C
✔ Saved 185 annotations to database SQLite
Dataset: tg
Session ID: 2024-04-15_08-36-39
Now, open the web server URL outputted above, and begin highlighting the polymer and glass transition temperature spans according to their corresponding labels. Once all data samples have been labeled, save the annotations with the key command Ctrl-S
or Cmd-S
as appropriate, and interrupt the kernel to end the annotation session.
For a more detailed reference on the Prodigy annotation process, see the Prodigy span categorization documentation here.
Next, to generate a Spacy binary data file, we will run Prodigy’s data-to-spacy
command to generate training and development dataset files, or train.spacy
and dev.spacy
, respectively, and save them to ChemREL’s scdata
directory. For this example, we have opted to use the tok2vec
model and have thus selected the available sc_tok2vec.cfg
config file.
Note: To define a custom evaluation split or add other constraints, see the data-to-spacy
command reference.
!python -m prodigy data-to-spacy ./scdata --spancat tg --config ./configs/sc_tok2vec.cfg
ℹ Using language 'en'
============================== Generating data ==============================
Components: spancat
Merging training and evaluation data for 1 components
- [spancat] Training: 148 | Evaluation: 37 (20% split)
Training: 80 | Evaluation: 22
Labels: spancat (2)
✔ Saved 80 training examples
scdata/train.spacy
✔ Saved 22 evaluation examples
scdata/dev.spacy
============================= Generating config =============================
✔ Generated training config
======================== Generating cached label data ========================
✔ Saving label data for component 'spancat'
scdata/labels/spancat.json
============================= Finalizing export =============================
✔ Saved training config
scdata/config.cfg
To use this data for training with spaCy, you can run:
python -m spacy train scdata/config.cfg --paths.train scdata/train.spacy --paths.dev scdata/dev.spacy
Training a New Model#
Now that we have generated our binary Spacy data files, it’s time to train a new ChemREL Tok2Vec span categorizer model from our annotations. To do so, we will run ChemREL’s span train-cpu
command on the data files and config file we generated, as follows.
Note: To end training prematurely, terminate the kernel.
!chemrel span train-cpu --tok2vec-config ./scdata/config.cfg
ℹ Saving to output directory: sctraining
ℹ Using CPU
=========================== Initializing pipeline ===========================
[2024-04-15 08:42:35,253] [INFO] Set up nlp object from config
[2024-04-15 08:42:35,261] [INFO] Pipeline: ['spancat']
[2024-04-15 08:42:35,264] [INFO] Created vocabulary
[2024-04-15 08:42:35,264] [INFO] Finished initializing nlp object
[2024-04-15 08:42:35,362] [INFO] Initialized pipeline components: ['spancat']
✔ Initialized pipeline
============================= Training pipeline =============================
ℹ Pipeline: ['spancat']
ℹ Initial learn rate: 0.0005
E # LOSS SPANCAT SPANS_SC_F SPANS_SC_P SPANS_SC_R SCORE
--- ------ ------------ ---------- ---------- ---------- ------
0 0 501.80 0.00 0.00 0.00 0.00
10 200 2531.60 0.00 0.00 0.00 0.00
23 400 219.50 80.00 100.00 66.67 0.80
39 600 25.04 100.00 100.00 100.00 1.00
60 800 5.83 80.00 100.00 66.67 0.80
84 1000 2.10 100.00 100.00 100.00 1.00
114 1200 2.51 100.00 100.00 100.00 1.00
151 1400 0.53 100.00 100.00 100.00 1.00
196 1600 0.17 100.00 100.00 100.00 1.00
249 1800 0.15 100.00 100.00 100.00 1.00
316 2000 0.13 100.00 100.00 100.00 1.00
394 2200 0.05 100.00 100.00 100.00 1.00
494 2400 0.07 100.00 100.00 100.00 1.00
^C
Aborted!
After training is complete, the best and last trained model will be saved in the model-best
and model-last
folders, respectively, within the sctraining
directory.
Generating Predictions#
Now that we have trained a new model, we can load the model to generate predictions on unseen text. To do so, we reference the trained model file and invoke ChemREL’s predict span
command, as follows.
!chemrel predict span ./sctraining/model-best "The polymer Ge2Sb2Te5 had transition temperature Tg = 398 K"
┏━━━┳━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━┓
┃ # ┃ Span ┃ Label ┃ Confidence ┃
┡━━━╇━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━┩
│ 1 │ Ge2Sb2Te5 │ POLYMER │ 0.99874985 │
│ 2 │ 398 K │ TG │ 0.9999907 │
└───┴───────────┴─────────┴────────────┘
Alternatively, the prediction functionality can be invoked via code by importing the chemrel.functions.predict
submodule, as follows.
from chemrel.functions import predict
predict.predict_span("sctraining/model-best", "The polymer Ge2Sb2Te5 had transition temperature Tg = 398 K")
{'POLYMER': [('Ge2Sb2Te5', 0.99874985)], 'TG': [('398 K', 0.9999907)]}
Nice work! You have successfully trained your first ChemREL extraction model. To view the full CLI documentation for ChemREL, and to learn about ChemREL’s additional functionality such as how to train relation extraction and transfer learning models, see the CLI Reference page here.