Solved Project Introduction:This project is going to utlize

Project Introduction:

This project is going to utlize techniques our team has learned from our cloud computing class, Social Media Data Analytics Class, & EDA clasess. We are going to utilizes a corpus of text from multiple data sources and apply different topic modeling algorithms.

The Data Analytical method:

  • Simple NLP Techniques: Tokenizing, stemming, etc…
  • Topic Modeling: LDA, LSA, NMF, pLSA, HDP, Doc2Vec, Possibly incorporate Deep Learning Techniques from DSCI 471.
  • Seaborn & MatplotLib for data visualization.

Cloud Computing techniiques:

  • Utilize PySpark to quickly analyze data from vertex ai
  • Vertex AI & DataProc for Our Jupyter Notebook. We will use these services to build out our models

Datasets:

Problem Defination:

The problem that is going to be solved is to topic model text data and news articles. The topic comes from news and text classification which is nessecary for news sites, social media sites, and any other problem binning news articles together is important.

Data Source:

US Financial News Dataset 2018

This dataset’s text and metadata are all accessed through a json file, we will access it through the json library. Most of the metadata due to this being a text classificaiton problem will be easily tossed the beginning of our preliminary analysis.

2017 Wikipedia Articles
This dataset is accessed through querying with it in sql lite. There are plenty of early kaggle notebooks to help us in querying these at first.

Note:

CAAN YOU DO THE ABOVE TASK AND CAN YO WRITE THE CODE IN JUYPTER NOTEBOOK USING ALL THE TOOLS AND TECHHNIQUES MENTIOED ABOVE.

Read more here: Source link