NGI Forward - Topic modelling

Analysis primer

We map social challenges of the tech world using text mining

➀ We identified 6 umbrella topics related to social challenges of internet technologies

We have assigned keywords for each topic. Those keywords were used for article retrieval

➁ We use tech articles shared on Twitter , Reddit and Hackernews

We extract article texts and meta data using Python package Newspaper3k

➂ Our dataset consists of 111k articles

➃ We cluster the articles based on their similarity

Text data can be treated as high dimensional vectors. Reducing dimensionality and preserving meaningful clusters is a well known challenge in the text mining field. We have applied an original algorithm combination (t-SNE using single perplexity 50 and Gaussian mixture) which proved to be effective in producing coherent maps of articles.

Read technical annex

The most frequently occurring words in the articles

Key characteristics

Number of articles

111139

Main domains

Medium, The Guardian, NYTimes

Analysis primer

Analysis primer

➀ We identified 6 umbrella topics related to social challenges of internet technologies

➁ We use tech articles shared on Twitter , Reddit and Hackernews

➂ Our dataset consists of 111k articles

➃ We cluster the articles based on their similarity

Top domains identified in the articles

The most frequently occurring words in the articles

Key characteristics

Contact Us

Address

E-mail