Topic Modelling with LDA: A Comprehensive Guide for Students

2/8/24, 5:59 PMtopic-modelling-lda.ipynb – Colaboratory 1/5# This Python 3 environment comes with many helpful analytics libraries installed # It is defined by the kaggle/python Docker image: # For example, here’s several helpful packages to load import numpy as np # linear algebra import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv) # Input data files are available in the read-only “../input/” directory # For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory import os for dirname, _, filenames in os.walk(‘/kaggle/input’): for filename in filenames: print(os.path.join(dirname, filename)) # You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using “Save & R # You can also write temporary files to /kaggle/temp/, but they won’t be saved outside of the current session /kaggle/input/million-headlines/abcnews-date-text.csv publish_dateheadline_text 020030219aba decides against community broadcasting lic… 120030219act fire witnesses must be aware of defamation 220030219a g calls for infrastructure protection summit 320030219air nz staff in aust strike for pay rise 420030219air nz strike to affect australian travellers 122625320201231what abc readers learned from 2020 looking bac… 122625420201231what are the south african and uk variants of … 122625520201231what victorias coronavirus restrictions mean f… 122625620201231whats life like as an american doctor during c… 122625720201231womens shed canberra reskilling unemployed pan… 1226258 rows × 2 columns data = pd.read_csv(‘../input/million-headlines/abcnews-date-text.csv’) data documents = data[‘headline_text’].reset_index()[[‘headline_text’, ‘index’]] * Splitting the text into sentences and then into words. * Cleaning any uunnecessary non-alphanumeric characters. * Lowercase all strings. * Removing articles, stopwords and other noise. Data Preparation import gensim from gensim.utils import simple_preprocess from gensim.parsing.preprocessing import STOPWORDS from nltk.stem import WordNetLemmatizer, SnowballStemmer from nltk.stem.porter import * import nltk nltk.download(‘wordnet’) [nltk_data] Downloading package wordnet to /usr/share/nltk_data…

Read more here: Source link