From 84ae9c0267a32488d5163e4fec325fb723703090 Mon Sep 17 00:00:00 2001 From: Julian T Date: Tue, 11 Jan 2022 12:01:29 +0100 Subject: Add exam notes for db and pp --- sem7/db/eksamnen.md | 334 ++++++++++++++++++++++++++++++++++++++++++++++++++++ sem7/pp/eksamnen.md | 202 +++++++++++++++++++++++++++++++ sem7/pp/larger.hs | 21 ++++ sem7/pp/lec5.hs | 22 +--- 4 files changed, 558 insertions(+), 21 deletions(-) create mode 100644 sem7/db/eksamnen.md diff --git a/sem7/db/eksamnen.md b/sem7/db/eksamnen.md new file mode 100644 index 0000000..2a6480b --- /dev/null +++ b/sem7/db/eksamnen.md @@ -0,0 +1,334 @@ + +# TODO + + - Lav opgave 2.3 i distributed thing + +# Nice Spatial SQL Commands + + - `ST_GeometryType(thing)` finds the type of `thing`. + - `ST_Boundary(thing)` finds the boundary of `thing`. There is probably a interior version too. + - `ST_DWithin(a, b, distance)` returns true if `a` and `b` are within `distance` of each other. + - `ST_MakeBox2D(a, b)` create a box from two points. + + +One can find nearest neighbour with: + +```sql +select l.*, +ST_Distance (l.geo , ST_Point (5, 5)) as dist +from landscape as l +order by dist +limit 5 +``` + +# Introduction Spatial Queries + +**Range queries** are the most simple, as we just query inside a square. +This is very similar to something like + +```sql +SELECT * WHERE salary BETWEEN 10 AND 20 +``` + +Only with range queries we often look in two dimensions instead of one. + +**Directional relationships** is stuff like north, west or south, and can the done with `ST_Azimuth`. +However this does not make much sense in 3D. + +## Topological Relationships + +We first define boundary and interior. +**Boundary** are the pixels that are adjendant to a pixel outside the object. +While **Interior** is the space inside the object not occupied by the boundary. + +If we consider two shapes, we can say that they either *intersects* and *disjoint*. +We they are disjoint we cannot really say more than that, but if they *intersect* they can either be + + - *equal* if their boundaries and interior are equal, + - *inside* if the one interior is completely inside the other, + - *contains* is the reverse of *inside* + - *adjendant* if their borders share pixels, but not their interior, + - *overlaps* if they share interior pixels, but they each have pixels which are not in the other. + +```sql +ST_Contains (b.geo , y.geo) +ST_Within (b.geo , y.geo) +ST_Touches (b.geo , y.geo) +ST_Equals (b.geo , y.geo) +ST_Overlaps (b.geo , y.geo) +``` + +## Points, Linestring and Polygons + +These are used to represent the different things in the 2d database. +This is nicely explained [here](https://help.arcgis.com/en/geodatabase/10.0/sdk/arcsde/concepts/geometry/shapes/types.htm). + +Points +: A single point represented by coordinates. It boundary is empty and the interior is just the point. + +Linestring +: A line between a sequence of at least two points. + Can either be closed, where the begin point is also the end. + The line between the points is the interior, and the boundary are the outer points. + +Polygon +: Define by an exterior closed linestring and zero og more interior closed linestrings. + The exterior ring must be defined against the clock and the interior with the clock, such that the right side of the linestring is always outside. + See #c6fd for a nice figure. + +# Spatial Database Indexing + +We can specify the indexing method in sql like this. + +```sql +-- Creation of the table +create table landscape ( +geo_name varchar (20) primary key , +geo geometry not null +); + +-- With B+ tree +create index landscape_geo_name +on landscape ( geo_name ); + +-- With R-tree, GIST, generalized search trees +create index landscape_geo +on landscape using gist (geo ); +``` + +## Space Filling Curves + +*Also refered to as SFC.* + +If is often more convenient to store points in a specific 1d ordering, as we can then use existing indexing methods. +However the spatial stuff we want to store and search is 2d. +Therefore we map this 2d data to 1d, making search and inserting much easier. + +We do this by splitting the 2d space into smaller chunks, and then we find a ordering of these chunks. +The ordering of the chunks is determined by a space filling curve, which there are many different. + +A good spacefilling curve is one where points that are close in 2d space, are also close to each other on the curve. + +However because we have many points in the same chunk, we can not know for sure whether they overlap just because they are in the same chunk. +This is therefore only used for finding candidates. + +Using space filling curves allows us to reuse the B+tree indexing structure, which is well supported by postgresql. +However they can also be combined with other quadtrees or R-trees which work in 2d by themselves. + +## Trees + +Here we will introduce *minimum-bounding rectangles* or MBRs, which is a axis aligned rectange. +These are used to give an estimate of an object, which can lead to false positives (but not false negatives). + +We will then insert those bounding boxes into a into nodes, with a maximum of M boxes per tree node. +When more than M nodes are to be inserted in a box we split it in two. + +Again when deleting we should combine nodes if they are smaller than `m * M`. +Here it's often hard to get the right clusterings of boxes, as we want to mimimize the deadspace. + +`m` is often `0.4` and `M` is often set according to page size etc. + +When searching we use a two step filtering process. +First we check fast intersections with boxes and then the expensive geometric intersection. + +What now refine R trees as *R+trees* which require that there is no overlap between the node regions. +We will therefore have to insert some objects twice. +However this eliminates the need to do multi-path searching. + +# Introduction to Parallel Databases + +Lets start with some important definitions. + +Distributed Database +: Software system that permits management of the distributed database and makes the distribution transparent to the user. +Scale-up +: Adding more CPU cores etc hardware. +Scale-out +: Adding more smaller machines and connecting them. +Throughput +: Time from query start to last row received. +Response Time +: Time from query start to first row received. + +There is a nice slide about transistency in this lecture. +It does not really make sense to repeat it here. + +## Performance + +When performing a query we have many choices for where to do this. +**Inter-query** is when we have many smaller queries that are performed in parallel, +while **intra-query** is a very large query that is queried concurrently. + +## Design of Distributed Database + +Often done in a *3-tier* architecture, with a client layer, application layer, and database server layer. +Each layer can then have many different interconnected computers, with communication to adjendant layers via a network. + +## Partitioning + +**Horizontal partitioning** (also called *sharding*) is when we have different sets of rows of the same table on different sites. +We therefore preserve the whole scheme at each site. +We can partition these in different ways: + + - **Range partitioning**, by partitioning on some column such as timestamps. + - **List partitioning** such as using *round-robin*. + - **Hash partitioning**. + +Here we should watch out for getting a *data-skew*, where data is partitioned unbalanced. +This could for example be if we partition on a name column. + +**Vertical partitioning** is where each site has each row, but the partitioning happens on the columns. +This allows for better compressions, but we must repeat primary key rows accross all sites for identification. + +We must be able to say something about partitioning. +Such as **completeness** which states that all rows in the originals, should still be present in at least one of the partitions. +Or **reconstruction** which states that the original table can be recovered by applying some operator to all partitions. +And lastly **disjointness** which states that the partitions should not share rows. + +# Just Parallel Databases (what) + +Scaling up as explained before, is often much harder. +This is also called *vertical scaling* as is associated with *tight coupling*. + +Scaling out is easier and often gives a more losely coupled thing, however this comes at the cost of complexity. +Scaling out is also called *horizontal scaling*. + +**Shared memory** is a architecture where we have a shared memory between processors. +Here each CPU can see the whole memory space, and they all run the same OS instance. +Often used for *inter-query parallelism* or *load balancing*. +However not very common for databases. + +**Shared disk** is very common for databases in cloud computing such as Amazon S3. +Here many processors, each with their own memory, shares a set of disks via an interconnect. +Therefore disks and CPU's can be shared independently. + +**Shared nothing :-(** very common with NoSQL. +Here many computers are connected with an interconnect and share nothing, such like *scale-out*. +This is harder to scale than shared-disk but can give better performance. + +**Symmetric algorithm** whether the processing of incomming tuples can be pipelined with own tuples. +See page 368 for a better explanation. + +## ACID and CAP + +ACID stands for + + - Atomicity + - Consistency + - Isolation + - Durability + +While CAP stands for + + - Consistent (same as in ACID) + - Always Available + - Partitioned + +With CAP we are only allowed to have two of the three *things*. + +## Amdahl's Law + +The more of our computation we can do in parallel, the better. +If we can do all of it (which is impossible) we get a linear grow in performance as we add more units. +However if we cannot it will have a curve down :-(. + +There is a nice formula for this in the slides, but i can try to write it here. + +```haskell +MaxSpeedUp seq p = 1 / (seq + ( (1-seq) / p)) +``` + +We are interested in parallising sorting and joining. +These operate on rows that are all over the place, and it is beneficial to do as much as possible on the device where stuff is stored, before sending it. + +# Summary of Parallel Databases + +This repeats some things in a nicer way. +I will just write what can be found in the slides, instead of repeating it. + + - Comparison between partitioning techniques, such as advantages of hash etc. + - Interquery parallism + + Multiple queries in parallel. Easiest to implement on shared-memory. + + - Intraquery parallism + + - **Intraoperation parallism** parallizes the execution of each operation in query. + - **Interoperation parallism** execute different query operations in parallel. + +# Intro Data Warehousing and Multidimensional Modeling + +Reasons for data warehousing + + - Often data quality is bad (different formats, inconsistent data) + - Data is not permanent (deleted after 30 days, or overwritten by changes) + +This makes it hard to answer more bussiness oriented queries. + +If we consider the example of a single sale, then this sale would be a *fact* in the database, and the sales value would be called a *measure*. +Then this sales fact is associated with a set of *dimensions* such as the product type, the store and the sale time. + +Each of these dimensions are oriented in a hierarchy, such that we can "zoom in" or out for more or less detail. +For example with a timestamp, we can only be interested in week or year, but also zoom into specific days. + +Therefore we divide data into dimensions and facts. + +Facts +: an important entity such as a sale. Often contain a *measures* which are things that can be aggregated, such as sales price. +Dimensions +: describes a fact such that we can search for different things. For example where did the sale happen, who bought it etc. + +We can then say that facts live a *cube* with many dimensions (yeah that does not make such sense, so *hypercube* is also sometimes used). +Often these cubes have around 4-12 dimensions, but often 2 or 3 can be shown due to our stupid brains. +We denote a *sparse cube* as a cube with many empty cells and the opposite as a *dense cube*. + +The slides have a nice todo list of things todo when designing a data warehouse. + +## Types of Facts + +Event fact +: or an *transaction fact* is a bussiness event such as a sale. +Measureless fact +: without a numerical measure, so just something that happened for some combination of dimensions. +State fact +: captures the current status at some timestamp, such as inventory. + +## Granularity + +What does a single fact mean, like the *level of detail*, "is a single fact the total sales in a day, or every single sale?". +We often let the granularity be a single business transaction, such as a sale, however we can aggregate this for scalability. + +## More on Measures + +This is something that a user might want to study for optimize for. +Each measure has two components, namely a *numerical value* and a *aggregation formula*. + +There are different forms of measures: + + - **Additive measure** can be aggregated for all dimensions. A good example is the sale price. + - **Semi-additive** can be aggregated over some dimensions, but not all. + A good example is inventory, which cannot be aggregated over time (we would count stuff twice), but we can over aggregate over store to get total inventory. + - **Non-additive** cannot be aggregated over anything. For example if it is a average sales price. + +## Implementation + +Here the granuality is avery important to consider, "what does a single row represent". + +It is not very nice to store the whole dimension stuff with each fact, so instead we use *star schemas* and *snowflake schemas*, +with work with *fact tables* and *dimension tables*. + +A **fact table** stores facts, and facts only, with references to the dimension values in the *dimension tables*. +Here we use a *surrogate key* to reference dimension values, which should be meaningless. + +A **Star schema** has one fact table, and then one de-normalized dimension table for each dimension containing all levels. +This elliminates a lot of the foreign key usage, making stuff faster, however it hides the levels from the user and takes up more storage. + +A **snowflake schema** stores the dimensions normalized, in that each level has its own dimension table. + +In regards to *redundancy*, this is something we want to avoid, as it allows for inconsistent data and complexities with updating. + +# Advanced Multidimensional Modeling + +*warning: very very advanced* + + diff --git a/sem7/pp/eksamnen.md b/sem7/pp/eksamnen.md index 6641d2a..be694ae 100644 --- a/sem7/pp/eksamnen.md +++ b/sem7/pp/eksamnen.md @@ -7,6 +7,9 @@ ## Alt Andet + - "The last condition ..." i logic part 2 1:40. Er det rigtigt? + - Er det ikke en fejl i 0:32 i prolog lists videon + # Mixed *Parameters* is in the declaration, and *arguments* are the things actually passed. @@ -21,6 +24,19 @@ The arguments are placed in the body according to the parameters. **Eta** says that functions that just pass arguments to other functions can be substritued by its body. +## Some Haskell Things + +```haskell +-- if +hej = if (x == 10) then 2 else 3 + +-- List comprehention +evens = [i | i <- [0..], i `mod` 2 == 0] + +-- Here True and False are *term contructors* and Boolean is a *type constructor* +data Boolean = True | False +``` + # Scheme lek. 1 Programming paradigm @@ -184,3 +200,189 @@ Therefore normal order is the post powerful reduction `(delay expr)` is used to delay the evaluation of `expr`, by returning a promise. The value of this can then be extracted with `(force promise)`. + +# Haskell lec.5 Lazy Evaluation + +Lazy evaluation is the default in haskell. +Semanticly it is quite simple requiring only a small amount of rules. + +There is the beta rewrite rule from before. +And then there is *application* rule which states that for an application `e1 e2` we only evaluatate `e1`. +This means that `e2` is newer evaluated before the application is done. + +# Haskell lec.7 Type Systems + +## Simple Type System + +Her har man at expressions kan have simple typer som Int, Bool, eller funktionener og tuples. +Dette giver nogle forskellige garantier. + +Hvis `e` er *wel typed* og `e` reducere til `e'`, så er `e'` også vel typed. +Dette hedder **subject reduction** og siger noget om *type preservation*. + +Den anden er at hvis `e` er *well typed*, så terminere `e`, altså der er en `e'` som er `e` reduceret og `e'` kan ikke reduceres længere. +Dette er **type safety**. + +Disse garantier betyder også at der er mange ting man ikke kan representere. +For eksempel hvad hvis man gerne vil lave programmer der terminere. + +Dette kan man sige fordi det imple type system har *slack*. +Dette er fordi der er expressions der er safe, men ikke er vel typed. +For eksempel: + +``` +let f = x: x; in (f 0, f True) +``` + +Dette giver bare `(0, True)` men det er ikke well typed, fordi der ikke en type for `f`. + +## Parametric Polymorhism + +Her introducere man + + - **Type variables** som der kan representere hvad som helst. + - **Type schemes** som siger noget om free og bound type variables. + +Hvis ens type scheme ikke har nogen quantifier for en bestemt type variable, så er den variabel *free*. + +Her kan man sige at `Ftv(t)` finder alle frie type variables i typen `t`. +Her kan man også sige `Ftv(E)` for type environment `E` så er resultatet `Ftv(t)` for alle typer `t` i `E`. + +### Specialization + +Her snakker man om at et type scheme er en specialization af en type. +Så man laver en type fra en type scheme. + +Så for eksempel er `Int -> Int \leq \forall a. a -> a`. + +### Generalization + +Her er det det omvendte, med at man laver en type scheme ud fra en type. +Dette er hvis man har en type har free variable `a`, så kan man lave et scheme der indeholder `\forall a`. + +Dette gør man med funktionen `close(E, t)`. +Denne funktion ville tage og lave quantifiers for alle `Ftv(t) \setminus Ftv(E)`. + +### Principle Types + +En type `t` er principle for expression `e` og type env `E` hvis enten + + - `t` er typen for `e` i `E`, + - eller typen `t1` for `e` i `E` er en specialization af `t`. + +Så man kan kalde principle types den mest generelle type. + +Her vil vi gerne have et *type inference algorithm* der finder den principielle type for en expression. + +### Semantic Regler + +Her introducere man nogle regler der sammen med gamle regler kan identificere typer for expressions. +Dette er `PROJ` som laver specialization, og `GENERALIZATION` som laver generalization. + +# Haskell lec.8 More Typing + +## Type Classes + +With typeclasses, we now say that the quantifiers in a scheme do not say `a can be everything` but `a is part of typeclass \Gamma`. +We use `\Gamma` for a collection of type classes. + +Dette betyder at når man normalt bestemte tyder ud fra et environment der mapper Vars til Types, +er der nu også en `C` der mapper fra typevariabler til typeclasses. + +Okay tror jeg laver de her noter på papir. +Lidt meget matematik. + +# Logic lec.9 Introduction and Datalog + +Her skriver jeg igen noter på papir. + +# Logic lec.10 Datalog + +**Herbrand universe** is the set of constant that appear in the datalog program. +These are the things which we can talk about. + +The **Herbrand base** is then the set of all ground atoms that can be created. +This is basicly all the things we can say about the **Herbrand universe**, giving a pretty large set. + +## Interpretation + +An *interpretation* is a set of atoms from the Herbrand universe. +This could be any subset, and does not have to have to make sense. + +We will now talk about how a clause can be true under an interpretation, or how an interpretation models a clause. +Here facts are easy, as we just check if it is contained in the interpretation. +A rule is only false if all the facts in the body are true and the head is false, due to the implicative nature of rules. + +## Model + +We say that an interpretation is a model for a clause, if every ground instance of the clause is true under the interpretation. +Such an interpretation we denote with M. + +We denote the *minimal model* for a program as the smallest possible model for the program. +This mimimal model can be found as the intersection between all models of the program. + +We can find it by starting with the empty interpretation, and then repeatedly extend it will all possible conclusions from the program. +First facts are added, and then we will start adding conclusions from rules. +This process will always finish. + +## Proof Directed Goal Search + +Instead of computing the minimal model, we are only interested in a specific conclusion or query. +This is done with *goal directed proof search*. + +This is done by constructing a proof tree. +Here we branch the tree every time we have to choose between two clauses. + +## Negation in Datalog + +In Datalog we can only negate atoms, and not rules. +Therefore we destinquist negative and positive atoms. +A negative atom can then only appear in the body of a rule and not the head. + +Here we extend the idea of when an atom is in an interpretation, such that `I |= not p(t)` holds if `p(t)` is not in `I`. + +In datalog we need to do some stuff for stuff to make sense, as we can create programs with multiple incompatible models: + +```prolog +P(x) :- not Q(x). +Q(x) :- not P(x). +``` + +Therefore we want to disallow circularity, which we do by defining layers. +We split the program in subprograms, where each subprogram only refers to things from the preveous layer. +We call this *stratification*. + +## Some Words + + - *Model-theoretical* semantics refers to finding a solution with Herbrand universe and base, interpretation and minimal models. + - *Fixed-point* semantics is the bottom up approach, where we compute the minimal model. + - *Proof-theoretical* semantics is the top down approach, with proof trees. + +These can be shown to be equivalent. + +The idea of a minimal model does not make much sense with prolog, as the minimal models can be infinite. + +# Logic lec.11 Prolog Instead of Datalog + +One should remember that prolog uses proof search instead of computing minimal model. +With proof search and the existence of recursive clauses, the ordering of clauses matters. +This is because the order of clauses determines the order in which proof search tries things. + +## Some Short Prolog Things + +```prolog +% This mimics the idea of haskells composite types. +nat(zero). +nat(succ(X)) :- nat(X). + +% Some list notation +% [1,2,3] er det samme som [1 | [2 | [3 | []]]] +``` + +The above thing means that the herbrand base is infinite. + +## Proof Search + +We check each clause in the order in which they appear. +If the head of a clause matches, we unify it with the body of the clause. +We then get new goals which we continue with. diff --git a/sem7/pp/larger.hs b/sem7/pp/larger.hs index 5618faf..2cf8e57 100644 --- a/sem7/pp/larger.hs +++ b/sem7/pp/larger.hs @@ -1,5 +1,26 @@ import Data.Maybe +-- Fra lec5 +valid :: (Integral a, Ord a) => [(a, b)] -> Bool +valid = recur empty + where recur _ [] = True + recur known (pair:rest) = + if key `member` known + then False + else recur (insert key known) rest + where key = fst pair + + +-- TODO, case where key is not in is unhandled +lookup' :: (Integral a, Ord a) => [(a, b)] -> a -> b +lookup' (pair:rest) key = if (fst pair) == key + then (snd pair) + else lookup' rest key + +-- I dont see how findfun and lookup are not the same +findfun = lookup' + + data LTree = LLeaf String | LNode LTree LTree data LTreeRoot = LTreeRoot LTree Float diff --git a/sem7/pp/lec5.hs b/sem7/pp/lec5.hs index 3340346..37f6b12 100644 --- a/sem7/pp/lec5.hs +++ b/sem7/pp/lec5.hs @@ -39,7 +39,7 @@ ispalindrome x = x == reverse' x -- Okay assuming that a are all integers, i --- would say that the fype of cfrac :: (Real a, Integral b) -> a -> b -> [b] +-- would say that the fype of cfrac :: (Real a, Integral b) => a -> b -> [b] cfrac _ 0 = [] cfrac x n = let intPart = truncate x in @@ -69,23 +69,3 @@ flatten' (x:xs) = x ++ flatten' xs -- The function type is therefore flatten' :: [[a]] -> [a] -- When running :t on flatten' we get the same. - - -valid :: (Integral a, Ord a) => [(a, b)] -> Bool -valid = recur empty - where recur _ [] = True - recur known (pair:rest) = - if key `member` known - then False - else recur (insert key known) rest - where key = fst pair - - --- TODO, case where key is not in is unhandled -lookup' :: (Integral a, Ord a) => [(a, b)] -> a -> b -lookup' (pair:rest) key = if (fst pair) == key - then (snd pair) - else lookup' rest key - --- I dont see how findfun and lookup are not the same -findfun = lookup' -- cgit v1.2.3