Milestones with Big data

Nao Kawakami
6 min readNov 11, 2020

This is the 4th week since I started studying Big data. I want to write down what I have been going through until now so that I can look back my progress. I started with basic syntax of Python and now I am struggling to catch up with Regression analysis.

I am excited to tackle Regression modeling because I have been curious how ML, AI works. For example, I am expecting I will be able to use this tool to recognize biometrics information. Face id and fingerprint id is surprising technology. I can use Python not only as advanced Excel, but as AI tool.

I had no idea what to write with Python even such as print() before. I had been wondering how we use this tool in real world. How print() can be helpful in real life? I do not come up with any idea of what to calculate. In real, there are not many opportunities to calculate except shopping in real world. Now I found it is just a calculator. But actually not just a calculator, is amazing, great, incredibly advanced calculator. Python can be used to make applications but with my understanding, data scientist use it mostly for calculator function. I do not need to think UI, UX as analysis process does not have space to be exposed to audience.

I am going to summarize what I have learned. This part can be my memo and material for someone who has curiosity in this field, but is not sure it is good to start.

First step of learning computer language is always printing Hello World. And then basic of basic Mathematics. Add, substract, divide, multiply and then modular. In daily life, modular is not often found but as for coding, it is often shown in many script. This part was fine, nothing special.

Second, variable, list and dictionary. Variable will be assigned with number, letter and so on. If same value needs to be used several times, also if you want to change the value, variable is strongly helpful. You can just use the variable and then you do not have to write long sentences or numbers over and over again. Also if you want to change the value, you can just reassign the variable with new number. You do not need to re-write the new value in code again as well. List can contain multiple values. What is the point of using list? First, your code will be much much simpler. Basically list is powerful especially when using loop statement. If you do not use list, you will have to write same code 100+ times. Dictionary works a bit similar as list in the point of being powerful when using loop statement. If you want to add index to each value in list, that is the time to use dictionary. I prefer using list and spend more time to learn detail other than using dictionary because List is simpler and easier to understand.

Next, data type. What is data type, why I have to know that. I did not think it is important thing at all because I did not have to refer data type until I got error message due to data type. Basic calculation and use of list mostly does not show you data type caution but more functions I learn, more important to know what the data type is. After you get out from basic use of Python, you cannot avoid seeing tons of error message, And many message says the error occurs because data type is not correct. 1+1 can be run but ‘1’+1 gives you error message as it is not proper data type. I did not understand why second one does not work, apparently it should return 2! I got stacked and spent much time to shoot this trouble. Later on I learn how to use String better then I found ‘1’ was not number, it was letter. After I leaned data type, I turned to be able to shoot error message much faster so that it is very important to know about data type. It affects your learning time and motivation significantly. Also you will often see in loop statement that Range cannot iterate. I often had error when using loop statement with range function. Intuitively if you put range, we know you want to iterate times of the range but Python does not understand it so that I needed to convert it to list. After all data type is important to know because it helps to solve error message and solving error message is important because it directly affect your learning performance.

Then control flow, mostly it is about For loop and If statement. Finally it becomes to sound like programing. This is the reason to use Python for calculation other than common calculator. For loop does same process. When you need to calculate BMI for 100 patients, you will hit a calculator 100 times but you just need to hit once if you use For loop. In many cases For loop will be used with List and I prefer to use List comprehension to make list. When you create list, you can write it in once sentence if you are able to use If, For and List comprehension well. When I first started studying List comprehension, I did not like to use it because it looks intuitively and all of list which is made with list comprehension can be written by For loop . For loop looked easier and more flexible so I liked to use For loop to create List. The moment that I found how useful List comprehension is was when I played around Codewars. It is a website where users share their code to solve quizzes. They just wrote one sentence to solve once quiz which I wrote 10 sentences. That was impressive and I found how ineffective my code was. In terms of time consuming , run speed and simpleness of the code, now I rather use List comprehension.

Above technics are used in any field using Python such as web application development. Any Python users know the technics. Now finally Python for Data Science. Pandas. Pandas can manage big data like Excel or SQL. Pandas can contains 1000 students information with their name, age, height, weight, sex, birthday and then it can sort, delete, add information or calculate average, sum and many parameters. Very quickly. What if you are asked to give average of 1000 students’ height. If you use normal calculator, type 1000 heights? It is too much work and probably you will make mistake. If you have Excel, ok, maybe Excel is easier to get the average of one parameter though, Pandas provides you the answer immediately. One thing that I like to use Pandas more that Excel is, I will not mess up the data accidentally. I may delete data by mistake, my elbow somehow might touch delete key, if I manage in Excel, but it will not happen in Pandas.

Then Regression analysis! This part is the most exciting part for me because finally I can get into Machine Learning. Let’s think simple example. As you use more water, you will pay more. That’s obvious. If you use double amount of water of last month, you can predict you will pay double amount of fee of last month. Let’s think more complicated one. You move to new city and you are looking for new house, you want to know how much the estimated rent is. There are many parameters when you choose house. You will check the size of room, if there are furniture, air conditioner, garage, kitchen, bathroom, palace and so on. Rent varies on the area even if the room is same quality. You can use Regression model to guess the price of your preferred room. You will not be caught by fraud.

Modeling is pretty difficult for me. I need to combine technics that I learned as above. I sorted and clean data with Pandas, and need For, List, modular to filter information. When I make regression model, there are many noises and I do not know what parameter affects good or bad, There is convenient evaluation method so that I make one model, evaluate, add more parameter or reduce parameters and then find best fit model. I am still far away to develop Face id.

--

--