Add exam notes for db and pp

author: Julian T <julian@jtle.dk> 2022-01-11 12:01:29 +0100
committer: Julian T <julian@jtle.dk> 2022-01-11 12:01:29 +0100
commit: 84ae9c0267a32488d5163e4fec325fb723703090 (patch)
tree: 7b5dc72eee1ce86c401500f0085f1cf4fa7ea427 /sem7
parent: 97a67fae19de5c824d4baa7ead1965f97f28483e (diff)
4 files changed, 558 insertions, 21 deletions
diff --git a/sem7/db/eksamnen.md b/sem7/db/eksamnen.md
new file mode 100644
index 0000000..2a6480b
--- /dev/null
+++ b/sem7/db/eksamnen.md
@@ -0,0 +1,334 @@
+
+# TODO
+
+  - Lav opgave 2.3 i distributed thing
+
+# Nice Spatial SQL Commands
+
+  - `ST_GeometryType(thing)` finds the type of `thing`.
+  - `ST_Boundary(thing)` finds the boundary of `thing`. There is probably a interior version too.
+  - `ST_DWithin(a, b, distance)` returns true if `a` and `b` are within `distance` of each other.
+  - `ST_MakeBox2D(a, b)` create a box from two points.
+
+
+One can find nearest neighbour with:
+
+```sql
+select l.*,
+ST_Distance (l.geo , ST_Point (5, 5)) as dist
+from landscape as l
+order by dist
+limit 5
+```
+
+# Introduction Spatial Queries
+
+**Range queries** are the most simple, as we just query inside a square.
+This is very similar to something like
+
+```sql
+SELECT * WHERE salary BETWEEN 10 AND 20
+```
+
+Only with range queries we often look in two dimensions instead of one.
+
+**Directional relationships** is stuff like north, west or south, and can the done with `ST_Azimuth`.
+However this does not make much sense in 3D.
+
+## Topological Relationships
+
+We first define boundary and interior.
+**Boundary** are the pixels that are adjendant to a pixel outside the object.
+While **Interior** is the space inside the object not occupied by the boundary.
+
+If we consider two shapes, we can say that they either *intersects* and *disjoint*.
+We they are disjoint we cannot really say more than that, but if they *intersect* they can either be
+
+  - *equal* if their boundaries and interior are equal,
+  - *inside* if the one interior is completely inside the other,
+  - *contains* is the reverse of *inside*
+  - *adjendant* if their borders share pixels, but not their interior,
+  - *overlaps* if they share interior pixels, but they each have pixels which are not in the other.
+
+```sql
+ST_Contains (b.geo , y.geo)
+ST_Within (b.geo , y.geo)
+ST_Touches (b.geo , y.geo)
+ST_Equals (b.geo , y.geo)
+ST_Overlaps (b.geo , y.geo)
+```
+
+## Points, Linestring and Polygons
+
+These are used to represent the different things in the 2d database.
+This is nicely explained [here](https://help.arcgis.com/en/geodatabase/10.0/sdk/arcsde/concepts/geometry/shapes/types.htm).
+
+Points
+: A single point represented by coordinates. It boundary is empty and the interior is just the point.
+
+Linestring
+: A line between a sequence of at least two points.
+  Can either be closed, where the begin point is also the end.
+  The line between the points is the interior, and the boundary are the outer points.
+
+Polygon
+: Define by an exterior closed linestring and zero og more interior closed linestrings.
+  The exterior ring must be defined against the clock and the interior with the clock, such that the right side of the linestring is always outside.
+  See #c6fd for a nice figure.
+
+# Spatial Database Indexing
+
+We can specify the indexing method in sql like this.
+
+```sql
+-- Creation of the table
+create table landscape (
+geo_name varchar (20) primary key ,
+geo geometry not null
+);
+
+-- With B+ tree
+create index landscape_geo_name
+on landscape ( geo_name );
+
+-- With R-tree, GIST, generalized search trees
+create index landscape_geo
+on landscape using gist (geo );
+```
+
+## Space Filling Curves
+
+*Also refered to as SFC.*
+
+If is often more convenient to store points in a specific 1d ordering, as we can then use existing indexing methods.
+However the spatial stuff we want to store and search is 2d.
+Therefore we map this 2d data to 1d, making search and inserting much easier.
+
+We do this by splitting the 2d space into smaller chunks, and then we find a ordering of these chunks.
+The ordering of the chunks is determined by a space filling curve, which there are many different.
+
+A good spacefilling curve is one where points that are close in 2d space, are also close to each other on the curve.
+
+However because we have many points in the same chunk, we can not know for sure whether they overlap just because they are in the same chunk.
+This is therefore only used for finding candidates.
+
+Using space filling curves allows us to reuse the B+tree indexing structure, which is well supported by postgresql.
+However they can also be combined with other quadtrees or R-trees which work in 2d by themselves.
+
+## Trees
+
+Here we will introduce *minimum-bounding rectangles* or MBRs, which is a axis aligned rectange.
+These are used to give an estimate of an object, which can lead to false positives (but not false negatives).
+
+We will then insert those bounding boxes into a into nodes, with a maximum of M boxes per tree node.
+When more than M nodes are to be inserted in a box we split it in two.
+
+Again when deleting we should combine nodes if they are smaller than `m * M`.
+Here it's often hard to get the right clusterings of boxes, as we want to mimimize the deadspace.
+
+`m` is often `0.4` and `M` is often set according to page size etc.
+
+When searching we use a two step filtering process.
+First we check fast intersections with boxes and then the expensive geometric intersection.
+
+What now refine R trees as *R+trees* which require that there is no overlap between the node regions.
+We will therefore have to insert some objects twice.
+However this eliminates the need to do multi-path searching.
+
+# Introduction to Parallel Databases
+
+Lets start with some important definitions.
+
+Distributed Database
+: Software system that permits management of the distributed database and makes the distribution transparent to the user.
+Scale-up
+: Adding more CPU cores etc hardware.
+Scale-out
+: Adding more smaller machines and connecting them.
+Throughput
+: Time from query start to last row received.
+Response Time
+: Time from query start to first row received.
+
+There is a nice slide about transistency in this lecture.
+It does not really make sense to repeat it here.
+
+## Performance
+
+When performing a query we have many choices for where to do this.
+**Inter-query** is when we have many smaller queries that are performed in parallel,
+while **intra-query** is a very large query that is queried concurrently.
+
+## Design of Distributed Database
+
+Often done in a *3-tier* architecture, with a client layer, application layer, and database server layer.
+Each layer can then have many different interconnected computers, with communication to adjendant layers via a network.
+
+## Partitioning
+
+**Horizontal partitioning** (also called *sharding*) is when we have different sets of rows of the same table on different sites.
+We therefore preserve the whole scheme at each site.
+We can partition these in different ways:
+
+  - **Range partitioning**, by partitioning on some column such as timestamps.
+  - **List partitioning** such as using *round-robin*.
+  - **Hash partitioning**.
+
+Here we should watch out for getting a *data-skew*, where data is partitioned unbalanced.
+This could for example be if we partition on a name column.
+
+**Vertical partitioning** is where each site has each row, but the partitioning happens on the columns.
+This allows for better compressions, but we must repeat primary key rows accross all sites for identification.
+
+We must be able to say something about partitioning.
+Such as **completeness** which states that all rows in the originals, should still be present in at least one of the partitions.
+Or **reconstruction** which states that the original table can be recovered by applying some operator to all partitions.
+And lastly **disjointness** which states that the partitions should not share rows.
+
+# Just Parallel Databases (what)
+
+Scaling up as explained before, is often much harder.
+This is also called *vertical scaling* as is associated with *tight coupling*.
+
+Scaling out is easier and often gives a more losely coupled thing, however this comes at the cost of complexity.
+Scaling out is also called *horizontal scaling*.
+
+**Shared memory** is a architecture where we have a shared memory between processors.
+Here each CPU can see the whole memory space, and they all run the same OS instance.
+Often used for *inter-query parallelism* or *load balancing*.
+However not very common for databases.
+
+**Shared disk** is very common for databases in cloud computing such as Amazon S3.
+Here many processors, each with their own memory, shares a set of disks via an interconnect.
+Therefore disks and CPU's can be shared independently.
+
+**Shared nothing :-(** very common with NoSQL.
+Here many computers are connected with an interconnect and share nothing, such like *scale-out*.
+This is harder to scale than shared-disk but can give better performance.
+
+**Symmetric algorithm** whether the processing of incomming tuples can be pipelined with own tuples.
+See page 368 for a better explanation.
+
+## ACID and CAP
+
+ACID stands for
+
+  - Atomicity
+  - Consistency
+  - Isolation
+  - Durability
+
+While CAP stands for
+
+  - Consistent (same as in ACID)
+  - Always Available
+  - Partitioned
+
+With CAP we are only allowed to have two of the three *things*.
+
+## Amdahl's Law
+
+The more of our computation we can do in parallel, the better.
+If we can do all of it (which is impossible) we get a linear grow in performance as we add more units.
+However if we cannot it will have a curve down :-(.
+
+There is a nice formula for this in the slides, but i can try to write it here.
+
+```haskell
+MaxSpeedUp seq p = 1 / (seq + ( (1-seq) / p))
+```
+
+We are interested in parallising sorting and joining.
+These operate on rows that are all over the place, and it is beneficial to do as much as possible on the device where stuff is stored, before sending it.
+
+# Summary of Parallel Databases
+
+This repeats some things in a nicer way.
+I will just write what can be found in the slides, instead of repeating it.
+
+  - Comparison between partitioning techniques, such as advantages of hash etc.
+  - Interquery parallism
+
+    Multiple queries in parallel. Easiest to implement on shared-memory.
+
+  - Intraquery parallism
+
+    - **Intraoperation parallism** parallizes the execution of each operation in query.
+    - **Interoperation parallism** execute different query operations in parallel.
+
+# Intro Data Warehousing and Multidimensional Modeling
+
+Reasons for data warehousing
+
+  - Often data quality is bad (different formats, inconsistent data)
+  - Data is not permanent (deleted after 30 days, or overwritten by changes)
+
+This makes it hard to answer more bussiness oriented queries.
+
+If we consider the example of a single sale, then this sale would be a *fact* in the database, and the sales value would be called a *measure*.
+Then this sales fact is associated with a set of *dimensions* such as the product type, the store and the sale time.
+
+Each of these dimensions are oriented in a hierarchy, such that we can "zoom in" or out for more or less detail.
+For example with a timestamp, we can only be interested in week or year, but also zoom into specific days.
+
+Therefore we divide data into dimensions and facts.
+
+Facts
+: an important entity such as a sale. Often contain a *measures* which are things that can be aggregated, such as sales price.
+Dimensions
+: describes a fact such that we can search for different things. For example where did the sale happen, who bought it etc.
+
+We can then say that facts live a *cube* with many dimensions (yeah that does not make such sense, so *hypercube* is also sometimes used).
+Often these cubes have around 4-12 dimensions, but often 2 or 3 can be shown due to our stupid brains.
+We denote a *sparse cube* as a cube with many empty cells and the opposite as a *dense cube*.
+
+The slides have a nice todo list of things todo when designing a data warehouse.
+
+## Types of Facts
+
+Event fact
+: or an *transaction fact* is a bussiness event such as a sale.
+Measureless fact
+: without a numerical measure, so just something that happened for some combination of dimensions.
+State fact
+: captures the current status at some timestamp, such as inventory.
+
+## Granularity
+
+What does a single fact mean, like the *level of detail*, "is a single fact the total sales in a day, or every single sale?".
+We often let the granularity be a single business transaction, such as a sale, however we can aggregate this for scalability.
+
+## More on Measures
+
+This is something that a user might want to study for optimize for.
+Each measure has two components, namely a *numerical value* and a *aggregation formula*.
+
+There are different forms of measures:
+
+  - **Additive measure** can be aggregated for all dimensions. A good example is the sale price.
+  - **Semi-additive** can be aggregated over some dimensions, but not all.
+    A good example is inventory, which cannot be aggregated over time (we would count stuff twice), but we can over aggregate over store to get total inventory.
+  - **Non-additive** cannot be aggregated over anything. For example if it is a average sales price.
+
+## Implementation
+
+Here the granuality is avery important to consider, "what does a single row represent".
+
+It is not very nice to store the whole dimension stuff with each fact, so instead we use *star schemas* and *snowflake schemas*,
+with work with *fact tables* and *dimension tables*.
+
+A **fact table** stores facts, and facts only, with references to the dimension values in the *dimension tables*.
+Here we use a *surrogate key* to reference dimension values, which should be meaningless.
+
+A **Star schema** has one fact table, and then one de-normalized dimension table for each dimension containing all levels.
+This elliminates a lot of the foreign key usage, making stuff faster, however it hides the levels from the user and takes up more storage.
+
+A **snowflake schema** stores the dimensions normalized, in that each level has its own dimension table.
+
+In regards to *redundancy*, this is something we want to avoid, as it allows for inconsistent data and complexities with updating.
+
+# Advanced Multidimensional Modeling
+
+*warning: very very advanced*
+
+
diff --git a/sem7/pp/eksamnen.md b/sem7/pp/eksamnen.md
index 6641d2a..be694ae 100644
--- a/sem7/pp/eksamnen.md
+++ b/sem7/pp/eksamnen.md
@@ -7,6 +7,9 @@
 
 ## Alt Andet
 
+  - "The last condition ..." i logic part 2 1:40. Er det rigtigt?
+  - Er det ikke en fejl i 0:32 i prolog lists videon
+
 # Mixed
 
 *Parameters* is in the declaration, and *arguments* are the things actually passed.
@@ -21,6 +24,19 @@ The arguments are placed in the body according to the parameters.
 
 **Eta** says that functions that just pass arguments to other functions can be substritued by its body.
 
+## Some Haskell Things
+
+```haskell
+-- if
+hej = if (x == 10) then 2 else 3
+
+-- List comprehention
+evens = [i | i <- [0..], i `mod` 2 == 0]
+
+-- Here True and False are *term contructors* and Boolean is a *type constructor*
+data Boolean = True | False
+```
+
 # Scheme lek. 1
 
 Programming paradigm
@@ -184,3 +200,189 @@ Therefore normal order is the post powerful reduction
 
 `(delay expr)` is used to delay the evaluation of `expr`, by returning a promise.
 The value of this can then be extracted with `(force promise)`.
+
+# Haskell lec.5 Lazy Evaluation
+
+Lazy evaluation is the default in haskell.
+Semanticly it is quite simple requiring only a small amount of rules.
+
+There is the beta rewrite rule from before.
+And then there is *application* rule which states that for an application `e1 e2` we only evaluatate `e1`.
+This means that `e2` is newer evaluated before the application is done.
+
+# Haskell lec.7 Type Systems
+
+## Simple Type System
+
+Her har man at expressions kan have simple typer som Int, Bool, eller funktionener og tuples.
+Dette giver nogle forskellige garantier.
+
+Hvis `e` er *wel typed* og `e` reducere til `e'`, så er `e'` også vel typed.
+Dette hedder **subject reduction** og siger noget om *type preservation*.
+
+Den anden er at hvis `e` er *well typed*, så terminere `e`, altså der er en `e'` som er `e` reduceret og `e'` kan ikke reduceres længere.
+Dette er **type safety**.
+
+Disse garantier betyder også at der er mange ting man ikke kan representere.
+For eksempel hvad hvis man gerne vil lave programmer der terminere.
+
+Dette kan man sige fordi det imple type system har *slack*.
+Dette er fordi der er expressions der er safe, men ikke er vel typed.
+For eksempel:
+
+```
+let f = x: x; in (f 0, f True)
+```
+
+Dette giver bare `(0, True)` men det er ikke well typed, fordi der ikke en type for `f`.
+
+## Parametric Polymorhism
+
+Her introducere man
+
+  - **Type variables** som der kan representere hvad som helst.
+  - **Type schemes** som siger noget om free og bound type variables.
+
+Hvis ens type scheme ikke har nogen quantifier for en bestemt type variable, så er den variabel *free*.
+
+Her kan man sige at `Ftv(t)` finder alle frie type variables i typen `t`.
+Her kan man også sige `Ftv(E)` for type environment `E` så er resultatet `Ftv(t)` for alle typer `t` i `E`.
+
+### Specialization
+
+Her snakker man om at et type scheme er en specialization af en type.
+Så man laver en type fra en type scheme.
+
+Så for eksempel er `Int -> Int \leq \forall a. a -> a`.
+
+### Generalization
+
+Her er det det omvendte, med at man laver en type scheme ud fra en type.
+Dette er hvis man har en type har free variable `a`, så kan man lave et scheme der indeholder `\forall a`.
+
+Dette gør man med funktionen `close(E, t)`.
+Denne funktion ville tage og lave quantifiers for alle `Ftv(t) \setminus Ftv(E)`.
+
+### Principle Types
+
+En type `t` er principle for expression `e` og type env `E` hvis enten
+
+  - `t` er typen for `e` i `E`,
+  - eller typen `t1` for `e` i `E` er en specialization af `t`.
+
+Så man kan kalde principle types den mest generelle type.
+
+Her vil vi gerne have et *type inference algorithm* der finder den principielle type for en expression.
+
+### Semantic Regler
+
+Her introducere man nogle regler der sammen med gamle regler kan identificere typer for expressions.
+Dette er `PROJ` som laver specialization, og `GENERALIZATION` som laver generalization.
+
+# Haskell lec.8 More Typing
+
+## Type Classes
+
+With typeclasses, we now say that the quantifiers in a scheme do not say `a can be everything` but `a is part of typeclass \Gamma`.
+We use `\Gamma` for a collection of type classes.
+
+Dette betyder at når man normalt bestemte tyder ud fra et environment der mapper Vars til Types,
+er der nu også en `C` der mapper fra typevariabler til typeclasses.
+
+Okay tror jeg laver de her noter på papir.
+Lidt meget matematik.
+
+# Logic lec.9 Introduction and Datalog
+
+Her skriver jeg igen noter på papir.
+
+# Logic lec.10 Datalog
+
+**Herbrand universe** is the set of constant that appear in the datalog program.
+These are the things which we can talk about.
+
+The **Herbrand base** is then the set of all ground atoms that can be created.
+This is basicly all the things we can say about the **Herbrand universe**, giving a pretty large set.
+
+## Interpretation
+
+An *interpretation* is a set of atoms from the Herbrand universe.
+This could be any subset, and does not have to have to make sense.
+
+We will now talk about how a clause can be true under an interpretation, or how an interpretation models a clause.
+Here facts are easy, as we just check if it is contained in the interpretation.
+A rule is only false if all the facts in the body are true and the head is false, due to the implicative nature of rules.
+
+## Model
+
+We say that an interpretation is a model for a clause, if every ground instance of the clause is true under the interpretation.
+Such an interpretation we denote with M.
+
+We denote the *minimal model* for a program as the smallest possible model for the program.
+This mimimal model can be found as the intersection between all models of the program.
+
+We can find it by starting with the empty interpretation, and then repeatedly extend it will all possible conclusions from the program.
+First facts are added, and then we will start adding conclusions from rules.
+This process will always finish.
+
+## Proof Directed Goal Search
+
+Instead of computing the minimal model, we are only interested in a specific conclusion or query.
+This is done with *goal directed proof search*.
+
+This is done by constructing a proof tree.
+Here we branch the tree every time we have to choose between two clauses.
+
+## Negation in Datalog
+
+In Datalog we can only negate atoms, and not rules.
+Therefore we destinquist negative and positive atoms.
+A negative atom can then only appear in the body of a rule and not the head.
+
+Here we extend the idea of when an atom is in an interpretation, such that `I |= not p(t)` holds if `p(t)` is not in `I`.
+
+In datalog we need to do some stuff for stuff to make sense, as we can create programs with multiple incompatible models:
+
+```prolog
+P(x) :- not Q(x).
+Q(x) :- not P(x).
+```
+
+Therefore we want to disallow circularity, which we do by defining layers.
+We split the program in subprograms, where each subprogram only refers to things from the preveous layer.
+We call this *stratification*.
+
+## Some Words
+
+  - *Model-theoretical* semantics refers to finding a solution with Herbrand universe and base, interpretation and minimal models.
+  - *Fixed-point* semantics is the bottom up approach, where we compute the minimal model.
+  - *Proof-theoretical* semantics is the top down approach, with proof trees.
+
+These can be shown to be equivalent.
+
+The idea of a minimal model does not make much sense with prolog, as the minimal models can be infinite.
+
+# Logic lec.11 Prolog Instead of Datalog
+
+One should remember that prolog uses proof search instead of computing minimal model.
+With proof search and the existence of recursive clauses, the ordering of clauses matters.
+This is because the order of clauses determines the order in which proof search tries things.
+
+## Some Short Prolog Things
+
+```prolog
+% This mimics the idea of haskells composite types.
+nat(zero).
+nat(succ(X)) :- nat(X).
+
+% Some list notation
+% [1,2,3] er det samme som [1 | [2 | [3 | []]]]
+```
+
+The above thing means that the herbrand base is infinite.
+
+## Proof Search
+
+We check each clause in the order in which they appear.
+If the head of a clause matches, we unify it with the body of the clause.
+We then get new goals which we continue with.
diff --git a/sem7/pp/larger.hs b/sem7/pp/larger.hs
index 5618faf..2cf8e57 100644
--- a/sem7/pp/larger.hs
+++ b/sem7/pp/larger.hs
@@ -1,5 +1,26 @@
 import Data.Maybe
 
+-- Fra lec5
+valid :: (Integral a, Ord a) => [(a, b)] -> Bool
+valid = recur empty
+  where recur _ [] = True
+        recur known (pair:rest) =
+          if key `member` known
+          then False
+          else recur (insert key known) rest
+          where key = fst pair
+
+
+-- TODO, case where key is not in is unhandled
+lookup' :: (Integral a, Ord a) => [(a, b)] -> a -> b
+lookup' (pair:rest) key = if (fst pair) == key
+                          then (snd pair)
+                          else lookup' rest key
+
+-- I dont see how findfun and lookup are not the same
+findfun = lookup'
+
+
 data LTree = LLeaf String | LNode LTree LTree
 data LTreeRoot = LTreeRoot LTree Float
 
diff --git a/sem7/pp/lec5.hs b/sem7/pp/lec5.hs
index 3340346..37f6b12 100644
--- a/sem7/pp/lec5.hs
+++ b/sem7/pp/lec5.hs
@@ -39,7 +39,7 @@ ispalindrome x = x == reverse' x
 
 
 -- Okay assuming that a are all integers, i
--- would say that the fype of cfrac :: (Real a, Integral b) -> a -> b -> [b]
+-- would say that the fype of cfrac :: (Real a, Integral b) => a -> b -> [b]
 
 cfrac _ 0 = []
 cfrac x n = let intPart = truncate x in
@@ -69,23 +69,3 @@ flatten' (x:xs) = x ++ flatten' xs
 -- The function type is therefore flatten' :: [[a]] -> [a]
 
 -- When running :t on flatten' we get the same.
-
-
-valid :: (Integral a, Ord a) => [(a, b)] -> Bool
-valid = recur empty
-  where recur _ [] = True
-        recur known (pair:rest) =
-          if key `member` known
-          then False
-          else recur (insert key known) rest
-          where key = fst pair
-
-
--- TODO, case where key is not in is unhandled
-lookup' :: (Integral a, Ord a) => [(a, b)] -> a -> b
-lookup' (pair:rest) key = if (fst pair) == key
-                          then (snd pair)
-                          else lookup' rest key
-
--- I dont see how findfun and lookup are not the same
-findfun = lookup'
author	Julian T <julian@jtle.dk>	2022-01-11 12:01:29 +0100
committer	Julian T <julian@jtle.dk>	2022-01-11 12:01:29 +0100
commit	84ae9c0267a32488d5163e4fec325fb723703090 (patch)
tree	7b5dc72eee1ce86c401500f0085f1cf4fa7ea427 /sem7
parent	97a67fae19de5c824d4baa7ead1965f97f28483e (diff)