Web scrapping for hierarchical clustering

Contact: Daichi Kuroda

Overview

Hierarchical clustering is the task of organizing data into a tree representation. Although the field has been intensively studied, the definition of hierarchy had not been clearly defined. Recently, we proposed a rigorous definition of hierarchy. In this project, you will collect text and metadata from Wikipedia and/or arXiv and evaluate how much agreement our definition of hierarchy shows with the hierarchy derived from the metadata.

Requirement:

  • web scrapping skill