87 lines
2.7 KiB
Markdown
87 lines
2.7 KiB
Markdown
|
---
|
||
|
title: Statistics
|
||
|
updated: 2022-04-02 15:10:58Z
|
||
|
created: 2021-05-04 14:58:11Z
|
||
|
---
|
||
|
|
||
|
# Statistics
|
||
|
|
||
|
## Data Types
|
||
|
|
||
|
- **Categorical**
|
||
|
- **Nominal Variables**
|
||
|
Intrinsic order of the labels:
|
||
|
Country of birth (Argentina, England, Germany)
|
||
|
Postcode
|
||
|
Vehicle make (Citroen, Peugeot, ...)
|
||
|
- **Ordinal Variables**
|
||
|
Can be meaningfully ordered are called ordinal:
|
||
|
Student's grade in an exam (A, B, C or Fail)
|
||
|
Days of the week (Monday = 1 and Sunday = 7)
|
||
|
- **Numerical**
|
||
|
- **Discrete**
|
||
|
how many cards in a game?
|
||
|
integers
|
||
|
- **Continuous**
|
||
|
height of a room
|
||
|
floating point numbers
|
||
|
|
||
|
What are proportions?
|
||
|
Is an aggregation of nominal data to provide a numerical figure. eq a percentage of nominal variables.
|
||
|
|
||
|
## Mixed Variables
|
||
|
|
||
|
- Observations show either numbers or categories among their values
|
||
|
- Number of credit accounts (1-100, U, T, M) U = unknown, T = unverified, M = unmatched)
|
||
|
- Observations show both numbers and categories in their values
|
||
|
- Cabin (Titanic) (A15, B18, ...)
|
||
|
|
||
|
|
||
|
## Distributions
|
||
|
![48751b057b60e03ec51f64e3235fa1b3.png](../../_resources/48751b057b60e03ec51f64e3235fa1b3.png)
|
||
|
Selecting something on de x-axis in the middle has an higher probability then rarer on to the edges.
|
||
|
Bell curve of Normal Distribution
|
||
|
|
||
|
![be8b17237548f72ecd8013f80df036dc.png](../../_resources/be8b17237548f72ecd8013f80df036dc.png)
|
||
|
Bi-mode distribution
|
||
|
|
||
|
![b7f8b2f785a9637ea5a22abe2877bca5.png](../../_resources/b7f8b2f785a9637ea5a22abe2877bca5.png)
|
||
|
Skewed distribution
|
||
|
|
||
|
Sample Distribution
|
||
|
![34262aff59c5f5dd9a413b2b3d74629a.png](../../_resources/34262aff59c5f5dd9a413b2b3d74629a.png)
|
||
|
|
||
|
$$
|
||
|
\overline{X} == variance
|
||
|
$$
|
||
|
|
||
|
|
||
|
## Sampling ande Estimation
|
||
|
eg
|
||
|
some best number of successes divided by the sampling, gives an estimate 10 / 3 = 3,3333
|
||
|
$$
|
||
|
\Theta == estimate with some variance around to make a good guess out of the sample
|
||
|
$$
|
||
|
![846c953521751f708bd680556dc9ae0b.png](../../_resources/846c953521751f708bd680556dc9ae0b.png)
|
||
|
So given an sample we have am 95% confidence out sample estimate is in this interval.
|
||
|
If less sure of this theta, the larger the confidence interval. eq because the n is much smaller.
|
||
|
|
||
|
![d575f021de579d10e3855c763198e7bc.png](../../_resources/d575f021de579d10e3855c763198e7bc.png)
|
||
|
|
||
|
## Hypothesis Testing
|
||
|
|
||
|
![981ea34418a595b422aab0b0df23f4b6.png](../../_resources/981ea34418a595b422aab0b0df23f4b6.png)
|
||
|
In Hypothesis Testing never:
|
||
|
- prove anything
|
||
|
- never accept the null hypothesis
|
||
|
|
||
|
## P-values
|
||
|
consider a null Hypothesis:
|
||
|
Hypothesis Test asses ig our sample is extreme enough to reject the null.
|
||
|
The p-value then measure how extreme our sample is.
|
||
|
|
||
|
![6af1399567c87fcda04a6414efbe18bf.png](../../_resources/6af1399567c87fcda04a6414efbe18bf.png)
|
||
|
|
||
|
|
||
|
## P-hacking
|
||
|
![ace369b638966681b9558c42e25dd0b4.png](../../_resources/ace369b638966681b9558c42e25dd0b4.png)
|