PART 1 : DA

Data Analytics- Part 1

Disclaimer:

  • This document contains unedited notes and has not been formally proofread.
  • The information provided in this document is intended to provide a basic understanding of certain technologies.
  • Please exercise caution when visiting or downloading from websites mentioned in this document and verify the safety of the website and software.
  • Some websites and software may be flagged as malware by antivirus programs.
  • The document is not intended to be a comprehensive guide and should not be relied upon as the sole source of information.
  • The document is not a substitute for professional advice or expert analysis and should not be used as such.
  • The document does not constitute an endorsement or recommendation of any particular technology, product, or service.
  • The reader assumes all responsibility for their use of the information contained in this document and any consequences that may arise.
  • The author disclaim any liability for any damages or losses that may result from the use of this document or the information contained therein.
  • The author and publisher reserve the right to update or change the information contained in this document at any time without prior notice.

*********************************************************************************

Linear Regression

Statistics means most of likely event.

35. Two Types of problems

1.     Regression
2.     Classification
Regression: is a continuous variable, which is influenced by dependent variables.

36. Conditional Mean:  

Separating the height from boys and girls.
·      Conditional models have least predicted error than the mean average representation

37. Conditional predictive analysis

1.     Dependent variables
2.     Independent variables

38. Simple linear regression

            Linear means a straight line
Y = m X + c
m = slope
c = constant
Y = intercept

Y1 = a + b x1 + error1
Actual y = structure y1 + error1

39. Correlation: always lies between “ -1 to +1”
·      r = 0.3 – low correlation
·      r = 0.6 – medium correlation
·      r = 0.8 – high correlation

40. Intercept: - where it cuts’ Y axis


41. Std Error: -

 sum of SD of all the points
·      Estimate the standard error

42. How far is far?

·      Here P- Value places the important role

Diagram


·      P-value indicates farther from zero.
·      P-value is small = Confidence interval is high = Std Error is small

43. F-statistic:  

            Collective information to give overall Model P-value.
If Overall or Collective Model is good, it signifies – different from Zero.
If equal to zero then- there is no information.

Eg. Code in R-studio
> # Predict children’s height # 7% variances is explained by father’s height
> x<- c(-4,-2,2,4,10)
> y<- c(-2,4,2,6,8)
> summary(lm(y~x))

Output: -
Call:
lm(formula = y ~ x)

Residuals:
   1    2    3    4    5
-2.0  2.8 -1.6  1.2 -0.4

Coefficients:
            Estimate Std. Error t value Pr(>|t|) 
(Intercept)   2.4000     1.1155   2.151   0.1205 
x             0.6000     0.2108   2.846   0.0653 .
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.309 on 3 degrees of freedom
Multiple R-squared:  0.7297,  Adjusted R-squared:  0.6396

F-statistic:   8.1 on 1 and 3 DF,  p-value: 0.06532

44.  Adjusted R-Square Value
When there are more number of variables (x) added to the model the “Multiple R-squared” doesn’t provide the right information,
For that “Adjusted R-square” value is used.
45. Prediction equation
Predict height of the children, given father’s and mother’s height.
Eg. Y = 34.65+ 0.42 * father_height + 5.17(male)
                        estimate
Intercept          34.46.113
Father              0.42
Factor
(Gender) Male 5.17

Here Mother and father are “X” used but only the factor (gender) M & father values are displayed

The left out “X” Mother’s height is the – Intercept values
So, the left-out “X” values are always represented in the intercept.

45. Error Sum of Squares

Residual Sum of Squares SSres
 Y = X b + e

46. Matrix multiplication


Hat matrix – (refer to more internet content)

y          = X      b          + E
nXl      nXp     pXl      nXl

the
fitted or predicted values from y = Xb

b hat = (XTX)-l XT y  
y hat = X(XTX)-' XT y
H = X(XTX)-lXT

y hat = Hy


H(pxp)= Hat matrix = X’(X’X)-1X

Xb is the predicted vector
Predicted vector = X(X’X)-1X’y

Error
= y-y(predicted)

= Iy-Hy

= (I-H)y

standardized residuals has sqrt(1-hii) in the denominator

Diagram






47. Logistic regression


Logistic regression is called as Classification Method.
The dependent variables are a class like – Yes/No, Zero/One, etc.

Here are the Diagram to find the confusion matrix
False “-ve” %
True “+ve” %
True “-ve” %
False “+ve” %
Diagram


Confusion matrix to determine the Goodness of the model

How to determine the model built is best?
McFadden goodness of fit measure is used for the same.

47.1 ROC – Receiver Operating Curve

ROC = True “+ve” %  vs   False “+ve” %

How much willing to allow 10% error to attain 80% of True “+ve” %.

Diagram


47.2

F1 Measure = weighted average of Precision and Recall

Precision = (True “+ve”) / (True “+ve” + False “+ve”)
Recall = (True “+ve”) / (False “-ve” + True “+ve”)


Examples
·      Credit card fraud
·      Health problem detection
·      Insurance Buying


48. Review of Regression and Logistic regression


1.     With Condition: prediction becomes Closer
2.     Why prediction works: Variance of condition expectation is lesser than non-conditioned expectations
3.     What is prediction: is condition expectation.
4.     Residuals: Difference b/w Actual – Predicted values
5.     Residual Analysis: 4types of graphs – Residuals Vs Fitted, Normal Q-Q, scale-location & Residuals Vs Leverage  
a.     par (mfrow = c (2,2); plot(model)
6.     What is Leverage Points: The One point that changes the intercept of slope from true points
7.     True points: the whole data set points.
8.     Predict equation: child height = (39.110) + (0.399 * father’s height)
9.     “P” -value: is that value, when P is small the “t” is big and it is significant value.
a.     P-value tells that estimate/ Std. Error/ t-value are significantly different from Zero.
10.  Std. Error: is the Std. Deviation of the estimate
11.  R = “-1” = “-ve” perfectly correlated
a.     R = 0 = There is no linear relationship, they don’t increase in linear form.
12.  Adjusted R2 : As the  no of variances increases the normal R2 increases, giving a wrong picture. Where by Adjusted-R2 will give indications whether the increase is significant or not
13.  BOX-COX: Transformation will get non-normal data into normal data.
14.  HAT Matrix: is used to calculate
a.     Cooks Distance ()
b.     Hat Value ()
c.     Covratio ()
15.  5-Fold Cross – Validation is must for complex models
a.     Data = 80%; Test = 20%
b.     Data is now created 5 distinct sets
c.     These are verified against the test data
16.  Confusion Matrix
a.     Logistic: - Likely hood of occurrence
b.     Regression: - Minimizing Sum of Squares
17.  ROC: - Where the cut has to be made, to allow the False “-ve” to attain highest True “+ve”
18.  F1 Measure = Harmonic Mean
2(PR/ (P+R))
19.  Precision: - Among the selected How many are the Targets (True)
20.  Recall: - Among the Targets, how many are selected.
21.  Specificity: -  False “+ve”
22.  Sensitivity: - True “+ve”

 ************************************************************************************************************

Comments

Popular Posts

Marriage Registration Online steps [Tamil Nadu]

HOME LAB : HANDS-ON

Privacy Settings for windows