Ben's Blog A place to share my thoughts

Masters Thesis

I have finally got round to uploading my masters’ thesis! It was by far the most challenging piece of work that I have ever done but it was an incredible and insightful journey taking me a solid 150-200hours! I have copied the title and the abstract below, so you can decide if you wish to embark on the adventure. If you do, I hope you find some value and learn a few things about the state of the industry, big data and distributed computing! I’ve licensed it under a Creative Commons Attribution 3.0 license, so you’re welcome to share and adapt the work but you must provide attribution to me, Benjamin Scabbia. Please do send me an email or leave a comment below should you have any questions.

What are some of the common problems encountered by organisations when analysing and managing Big Data, and can these be helped by using Distributed Computing?

Abstract

The Internet of Things is growing rapidly and contributing approximately 10% of the world’s Big Data. This, combined with advances in technology that have seen more sophisticated data-gathering sensors being inserted in a range of equipment, as well as social media, GPS tracking and more, has resulted in more Big Data being created today than ever before. Big Data (characterised by its large volume, variety of formats and the fast speed at which it is gathered) can be difficult for organisations to manage and analyse, but none more so than the organisation restricted by limited budgets. In fact, there is a widely disseminated sentiment that Big Data is only accessible to those operating within a big budget. This is because Big Data is difficult to manage and analyse due to a variety of interconnected key factors; it is costly, it is large (and so storage is problematic), it comes in a variety of formats (and so knowing what software to use to analyse it is difficult) and the tools currently available to service the market are incredibly complex (such as Hadoop). In addition, data of this scale commands a powerful system to be able to manage and analyse it. Centralized systems are expensive to scale, and limited in their capacity to scale-up to meet the required need, however distributed systems makes use of commodity hardware to scale outward indefinitely. For this reason, this project puts forward a middleware application, aimed at a small to medium sized organisation on a limited budget, built entirely using open-source software. It introduces a working prototype of an application capable of performing a basic analysis on a non-local data set via distributed computing. It uses existing technology to provide users with Big Data management and analysis tools in a low cost manner.

The full thesis is available to download here.

To view the source code, head over to my GitHub.

This work is licensed under a Creative Commons Attribution 3.0 License.