Summaries/AI/MachineLearning/Statistics.md

---
title: Statistics
updated: 2022-04-02 15:10:58Z
created: 2021-05-04 14:58:11Z
---

# Statistics

## Data Types

- **Categorical**
	- **Nominal Variables**
		Intrinsic order of the labels:
		Country of birth (Argentina, England, Germany)
	Postcode
	Vehicle make (Citroen, Peugeot, ...)
	- **Ordinal Variables**
	Can be meaningfully ordered are called ordinal:
	Student's grade in an exam (A, B, C or Fail)
	Days of the week (Monday = 1 and Sunday = 7)
- **Numerical**
	- **Discrete**
		how many cards in a game?
		integers
	- **Continuous**
		height of a room
		floating point numbers

What are proportions?
Is an aggregation of nominal data to provide a numerical figure. eq a percentage of nominal variables. 

## Mixed Variables

- Observations show either numbers or categories among their values
  - Number of credit accounts (1-100, U, T, M) U = unknown, T = unverified, M = unmatched)
- Observations show both numbers and categories in their values
  - Cabin (Titanic) (A15, B18, ...)


## Distributions
![48751b057b60e03ec51f64e3235fa1b3.png](../../_resources/48751b057b60e03ec51f64e3235fa1b3.png)
Selecting something on de x-axis in the middle has an higher probability then rarer on to the edges.
Bell curve of Normal Distribution

![be8b17237548f72ecd8013f80df036dc.png](../../_resources/be8b17237548f72ecd8013f80df036dc.png)
Bi-mode distribution

![b7f8b2f785a9637ea5a22abe2877bca5.png](../../_resources/b7f8b2f785a9637ea5a22abe2877bca5.png)
Skewed distribution

Sample Distribution
![34262aff59c5f5dd9a413b2b3d74629a.png](../../_resources/34262aff59c5f5dd9a413b2b3d74629a.png)

$$
\overline{X} == variance
$$


## Sampling ande Estimation
eg
some best number of successes divided by the sampling, gives an estimate  10 / 3 = 3,3333
$$
\Theta == estimate with some variance around to make a good guess out of the sample
$$
![846c953521751f708bd680556dc9ae0b.png](../../_resources/846c953521751f708bd680556dc9ae0b.png)
So given an sample we have am 95% confidence out sample estimate is in this interval.
If less sure of this theta, the larger the confidence interval. eq because the n is much smaller.

![d575f021de579d10e3855c763198e7bc.png](../../_resources/d575f021de579d10e3855c763198e7bc.png)

## Hypothesis Testing

![981ea34418a595b422aab0b0df23f4b6.png](../../_resources/981ea34418a595b422aab0b0df23f4b6.png)
In Hypothesis Testing never:
- prove anything
- never accept the null hypothesis

## P-values
consider a null Hypothesis:
Hypothesis Test asses ig our sample is extreme enough to reject the null.
The p-value then measure how extreme our sample is.

![6af1399567c87fcda04a6414efbe18bf.png](../../_resources/6af1399567c87fcda04a6414efbe18bf.png)


## P-hacking
![ace369b638966681b9558c42e25dd0b4.png](../../_resources/ace369b638966681b9558c42e25dd0b4.png)
Init rest 2022-08-09 21:04:44 +02:00			`---`
			`title: Statistics`
			`updated: 2022-04-02 15:10:58Z`
			`created: 2021-05-04 14:58:11Z`
			`---`

			`# Statistics`

			`## Data Types`

			`- Categorical`
			`- Nominal Variables`
			`Intrinsic order of the labels:`
			`Country of birth (Argentina, England, Germany)`
			`Postcode`
			`Vehicle make (Citroen, Peugeot, ...)`
			`- Ordinal Variables`
			`Can be meaningfully ordered are called ordinal:`
			`Student's grade in an exam (A, B, C or Fail)`
			`Days of the week (Monday = 1 and Sunday = 7)`
			`- Numerical`
			`- Discrete`
			`how many cards in a game?`
			`integers`
			`- Continuous`
			`height of a room`
			`floating point numbers`

			`What are proportions?`
			`Is an aggregation of nominal data to provide a numerical figure. eq a percentage of nominal variables.`

			`## Mixed Variables`

			`- Observations show either numbers or categories among their values`
			`- Number of credit accounts (1-100, U, T, M) U = unknown, T = unverified, M = unmatched)`
			`- Observations show both numbers and categories in their values`
			`- Cabin (Titanic) (A15, B18, ...)`


			`## Distributions`
			`![48751b057b60e03ec51f64e3235fa1b3.png](../../_resources/48751b057b60e03ec51f64e3235fa1b3.png)`
			`Selecting something on de x-axis in the middle has an higher probability then rarer on to the edges.`
			`Bell curve of Normal Distribution`

			`![be8b17237548f72ecd8013f80df036dc.png](../../_resources/be8b17237548f72ecd8013f80df036dc.png)`
			`Bi-mode distribution`

			`![b7f8b2f785a9637ea5a22abe2877bca5.png](../../_resources/b7f8b2f785a9637ea5a22abe2877bca5.png)`
			`Skewed distribution`

			`Sample Distribution`
			`![34262aff59c5f5dd9a413b2b3d74629a.png](../../_resources/34262aff59c5f5dd9a413b2b3d74629a.png)`

			`$$`
			`\overline{X} == variance`
			`$$`


			`## Sampling ande Estimation`
			`eg`
			`some best number of successes divided by the sampling, gives an estimate 10 / 3 = 3,3333`
			`$$`
			`\Theta == estimate with some variance around to make a good guess out of the sample`
			`$$`
			`![846c953521751f708bd680556dc9ae0b.png](../../_resources/846c953521751f708bd680556dc9ae0b.png)`
			`So given an sample we have am 95% confidence out sample estimate is in this interval.`
			`If less sure of this theta, the larger the confidence interval. eq because the n is much smaller.`

			`![d575f021de579d10e3855c763198e7bc.png](../../_resources/d575f021de579d10e3855c763198e7bc.png)`

			`## Hypothesis Testing`

			`![981ea34418a595b422aab0b0df23f4b6.png](../../_resources/981ea34418a595b422aab0b0df23f4b6.png)`
			`In Hypothesis Testing never:`
			`- prove anything`
			`- never accept the null hypothesis`

			`## P-values`
			`consider a null Hypothesis:`
			`Hypothesis Test asses ig our sample is extreme enough to reject the null.`
			`The p-value then measure how extreme our sample is.`

			`![6af1399567c87fcda04a6414efbe18bf.png](../../_resources/6af1399567c87fcda04a6414efbe18bf.png)`


			`## P-hacking`
			`![ace369b638966681b9558c42e25dd0b4.png](../../_resources/ace369b638966681b9558c42e25dd0b4.png)`