Multi-label classification and knowledge extraction from oncology-related content on online social networks

Mahdi Hashemi, Margeret Hall

Research output: Contribution to journalArticlepeer-review

10 Scopus citations


This study aims at automatic processing and knowledge extraction from large amounts of oncology-related content from online social networks (OSN). In this context, a large number of OSN textual posts concerning major cancer types are automatically scraped and structured using natural language processing techniques. Machines are trained to assign multiple labels to these posts based on the type of knowledge enclosed, if any. Trained machines are used to automatically classify large-scale textual posts. Statistical inferences are made based on these predictions to extract general concepts and abstract knowledge. Different approaches for constructing document feature vectors showed no tangible effect on the classification accuracy. Among different classifiers, logistic regression achieved the highest overall accuracy (96.4%) and F1 ¯ (73.4) in a 13-way multi-label classification of textual posts. The most common topic was seeking or providing moral support for cancer patients, followed by providing technical information about cancer causes and treatments. The most common causes and treatments of different types of cancer on OSN are also automatically detected in this study. Seeking or providing moral support for cancer patients shared the largest overlap with other topics, i.e. moral support tends to be present even in OSN posts which focus on other topics. On the other hand, providing technical information about cancer diagnosis or prevention were the most isolated topics, where OSN posts tend not to allude to other topics. OSN posts which seek financial support only overlap with the moral support topic, if any. Our methodology and results provide public health professionals with an opportunity to monitor what topics and to which extent are being discussed on OSN, what specific information and knowledge are being disseminated over OSN, and to assess their veracity in close to real time. This helps them to develop policies that encourage, discourage, or modify the consumption of viral oncology-related information on OSN.

Original languageEnglish (US)
Pages (from-to)5957-5994
Number of pages38
JournalArtificial Intelligence Review
Issue number8
StatePublished - Dec 1 2020


  • Cancer
  • Classification
  • Knowledge extraction
  • Machine learning
  • Natural language processing
  • Social networks

ASJC Scopus subject areas

  • Language and Linguistics
  • Linguistics and Language
  • Artificial Intelligence


Dive into the research topics of 'Multi-label classification and knowledge extraction from oncology-related content on online social networks'. Together they form a unique fingerprint.

Cite this