Archive

Posts Tagged ‘BigData’

How to identify Bigdata? A first hand explaination

Bigdata is the word which is being used very often but yet to be defined properly. How much data can be classified as big data? Does big data only means big in size or big in complexity or both? If hard disks can be clubbed together to store very very large amount of data then why whole world has become so obsessed with big data? Is it something different from traditional DBMS?

There are many questions which may come to the mind of a person who is starting his journey with big data. I can recall one of my friend Amrit who worked with a company which was also selling computers in year 1999. I asked him for quotes and he told me about a computer which was having 2GB of hard disk. He was very excited, in excitement he declare that you will not be able to fill that hard disk in next few years. In his words, “Its really Big hard disk which can hold big data” Today we can only laugh on it. Today even my car keys have a 64GB flash drive companion.

Cartoon: Big Data

Big is a relative term and its quite subjective. Although nowadays when data runs in several hundred GB we start calling it big data. My first encounter with big data was with apache logs on a server which was hosting more than few thousands websites. Due to restrictions on the number of opened file pointers on server, I tweaked apache configuration to store all logs in a single file with virtual host information as first column of the records.

We were supposed to process those logs for awstats log analyzer. Server was already under heavy load, so we transfer log files to another server and run our processing routines there. After processing data we were putting furnished awstats files back to originating server so users can see site stats without going to another server. For us this was big data.

I can identify following things due to which it was big for us :-

1. A single file was needed for multiple loggers as we were not able to write to multiple files at once due to restrictions on system.

2. As server was overloaded with high number of requests, we were not able to process our large file which require high memory as well as processor time on the same server.

So data can be classified as big data if your one machine is not able to create, hold or process it for the purpose you want to achieve.

If your machine is not sufficient to process or hold your data then first thing which come to your mind will be an upgrade of hardware. There is a catch in this option,  we have limitations on hardware upgrades and there will be a time when upgrades will not be possible.

So you need multiple systems to act like one. Whenever you feel that you are in that situation, you have big data at your disposal to handle.

Categories: PHP Notes Tags: ,

6 Algorithms you must know to be a good programmer

Algorithms are defined sets of instructions to solve a problem or achieve something. Almost every process in computer world can be classified as an algorithm, still it is most difficult part of computer science for many learners.

Algorithms are necessary to make your programs intelligent and efficient. They can make your program rock and it can stand out in front of competition. You need to be good at analysis, understanding and draw abstraction to get good hold on this subject.

In this post I am trying to compile a list of algorithms which every program should know. This list is not exhaustive and only indicative of some important sections of computer programming.

Sorting

Visualization of sorting algorithms experiments
You must be good at different kind of sorting algorithms like quick sort, bubble sort, merge sort, heap sort and many other useful sorting techniques. You must be able to differentiate between these algorithms based on their features, problems and potential usage scenarios. Try your hand with different combination of algorithms, who knows you can discover a new optimal algorithm while doing so.

Binary Search

binary search tree
This algorithm is necessary to be in sync with higher efficiency demand  which big data is presenting in front of programmers. Many databases are using this Binary search algorithm to provide quick query answers and it is also useful in many other scenarios.

Graph Search

Twitter social graph (social dances)

In this social network era, this algorithm is one of most needed and used algorithm. Many Algorithm like shortest path etc falls under this category. This category of algorithm is usually considered most difficult one to understand as it involves lots of dynamic variables. Knowledge of calculus in math really help in understanding this subject. Once you got the idea behind graphs, you can rock those optimal solution problems.

Stable Marriage

Do you know computer science can help you to find a good life partner or friend? This algorithm try to solve this problem in a given scenario where priorities of available people are dynamics and not linear in nature. This algorithm usually works to suggest good friends and life partner on may social networking or dating websites.

Map Reduce

MapReduce

Many algorithms has been developed to deploy MapReduce which is essential in developing distributed applications handling large data. This is essential part of Hadoop system using which we can run parallel processes to process a large amount of data.

Hilltop Algorithm

Many search engine marketers knows this algorithm which was initiated by Google to rank pages based on some kind of reference values. This algorithm (or similar algorithms) is necessary to dive into large amount of data which our world is generating everyday. Now we are facing a hill task to find useful data or pattern from so much of data noise, hill top can help in this.

 

There are many hundreds algorithms but you can start with these algorithms before you go to your journey of mastering this interesting subject.

 

One major thing which needs to be understood regarding algorithms are their patents. Many algorithms are patented and can not be used in our programs without a license.  So be sure that you are using an algorithm which you are entitled to use in your commercial programs for distributions. Many algorithms has been released under open source licenses.