The volume of data created by mankind every day is exploding, with 2.5 quintillion bytes of data created daily at our current pace, overflowing out of the myriad of devices we use every day. In fact, the number of Internet users has grown over a billion in the last five years, and more than half of the world’s web traffic now comes from mobile phones. All these changes bring up the potential for patterns, insights and secrets to be collected from data to be put to use.
Big data is the term used to describe the large amount of data that has the potential to be mined for knowledge and used in machine learning projects and for other advanced analytics purposes. It's an umbrella term for doing large-scale data analysis from all the little bits of data a business or a website gathers, that might otherwise be discarded, to draw valuable analysis from it.
Earlier, we still had access to all this data, but storing it was expensive and leveraging it for information was time and resource intensive. But with rapidly increasing computing power and increasing cheap storage in the form of cloud storage providers, a new spark has been ignited in the hearts and minds of big data practitioners.
So the current big question in is how to work with all of this data we now have access to? How to take advantage of the data boom? With the tools that the big data industry provides us, of course. So it is no surprise that it is among the hottest upcoming jobs in coming years as ranked by Harvard Business Review. And how do we get in on this? Let us see.
So there are many roles in the big data industry. But broadly speaking they can be classified into two categories: engineering and analytics. And these fields are linked but clearly different.
Big data engineering revolves around planning, deploying and maintaining a system to handle a large amount of data. These systems make relevant data available for various consumer-facing and internal applications.
Big data analytics revolves around making use of the large amounts of data from systems designed by the engineers, and analyzing patterns and trends and developing various classification and prediction algorithms. So basically it involves advanced computations on the data.
So which field would you be interested in? It depends on your background and your interests. There is no other reason to prefer one over the other. The world of big data infrastructure is changing by the day, with new innovations happening everywhere. There are a bunch of new technologies like Hadoop, NoSQL, BigQuery, Spark and so on. So to keep up with these rapidly changing landscapes, we must adapt quickly, while also knowing the basics well enough to not feel the ground shrinking under our feet.
The background knowledge needed for a strong background in big data technologies is the same as that for data science and machine learning. There are 2 codependent skill sets that make up this space :
- math and statistics - the main topics you should concern yourself with learning are linear algebra, calculus, and probability and statistics. Then you have to learn about the techniques used in machine learning such as linear and logistic regression, decision trees, KNN's, random forests and support vector machines. Check out this resource for learning about all these topics. Also available is Coding Blocks own course for machine learning and Python basics for a good grounding in these topics.
- programming - in the programming side of things, you should try to familiarize yourself with a programming language of your choice. R and Python are by far the most popular in this space. Try to learn the data analysis, visualization and machine learning packages available for your language. For example, for Python, these would be NumPy, SciPy, Pandas, matplotlib, scikit-learn, and so on. For R, they are diplyr, tidyr, readr, tibble, ggplot2, carat, and so on. Also, you need to be proficient in SQL as a data scientist.
In demand technologies
After learning all this, ultimately, you will want to participate in the creation of smart systems to enable computers to do more on their own and free up the time of humans in the process so they can do less rote work. So now let us look at the technologies most in demand in the big data field:
- Apache Hadoop: Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters of commodity hardware. Hadoop components like Hive, Pig, HDFS, HBase, MapReduce, etc. are in high demand these days.
- Familiarity with cloud tools such as Amazon S3 can also be beneficial.
- Apache Spark: It is becoming the most popular big data technology worldwide. It is a big data computation framework just like Hadoop.
- NoSQL: The NoSQL databases including Couchbase, MongoDB, etc. are replacing the traditional SQL databases like DB2, Oracle, etc.
Knowledge and expertise with these big data technologies can make you hirable anywhere. So if that is your goal, remember that there are always loads of resources to help you out anywhere when you get stuck, and a passionate community of developers behind these technologies to help you out. So, happy analyzing!