Integration of large-scale community-developed causal loop diagrams: a Natural Language Processing approach to merging factors based on semantic similarity
Abstract
Background
Complex public health problems have been addressed in communities through systems thinking and participatory methods like Group Model Building (GMB) and Causal Loop Diagrams (CLDs) albeit with some challenges. This study aimed to explore the feasibility of Natural Language Processing (NLP) in simplifying and enhancing CLD merging processes, avoiding manual merging of factors, utilizing different semantic textual similarity models.
Methods
The factors of thirteen CLDs from different communities in Victoria, Australia regarding the health and wellbeing of children and young people were merged using NLP with the following process: (1) extracting and preprocessing of unique factor names; (2) assessing factor similarity using various language models; (3) determining optimal merging threshold maximising the F1-score; (4) merging the factors of the 13 CLDs based on the selected threshold.
Results
Overall sentence-transformer models performed better compared to word2vec, average word embeddings and Jaccard similarity. Of 161,182 comparisons, 1,123 with a score above 0.7 given by sentence-transformer models were analysed by the subject matter experts. Paraphrase-multilingual-mpnet-base-v2 had the highest F1-score of 0.68 and was used to merge the factors with a threshold of 0.75. From 592 factors, 344 were merged into 66 groups.
Conclusions
Utilizing language models facilitates identification of similar factors and has potential to aid researchers in constructing CLDs whilst reducing the time required to manually merge them. While models accurately merge synonymous or closely related factors, manual intervention may be required for specific cases.