COMPUTER SCIENCE SENIOR DESIGN

FALL 2014

Siemens Big Data Analysis

 

 
Mario Massad

Mario Massad

Matthew Toschi

Matthew Toschi

Tyler Truong

Tyler Truong

 


 

Project Description

Every day, more articles of unstructured data populate the internet where they are left untouched. Much of it contains raw text, unlabeled and unclassified. Analyzing these enormous amounts of data can lead to making discoveries and trends amongst it. This information, when fully realized, can be utilized to find unknown relationships between the entities it contains, such as people, businesses, or other groups. Taking these unstructured documents, we use natural language processing and named entity recognition to identify entities of people, organizations, and locations, and recognize the connections which link them together. Implementing Latent Dirichlet Allocation (LDA), the similarity between documents is discovered. And together, the relevancy of documents and the similarities of the entities they share can shed light on connections previously undiscovered.

Project Files