From 84ae9c0267a32488d5163e4fec325fb723703090 Mon Sep 17 00:00:00 2001 From: Julian T Date: Tue, 11 Jan 2022 12:01:29 +0100 Subject: Add exam notes for db and pp --- sem7/db/eksamnen.md | 334 ++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 334 insertions(+) create mode 100644 sem7/db/eksamnen.md (limited to 'sem7/db/eksamnen.md') diff --git a/sem7/db/eksamnen.md b/sem7/db/eksamnen.md new file mode 100644 index 0000000..2a6480b --- /dev/null +++ b/sem7/db/eksamnen.md @@ -0,0 +1,334 @@ + +# TODO + + - Lav opgave 2.3 i distributed thing + +# Nice Spatial SQL Commands + + - `ST_GeometryType(thing)` finds the type of `thing`. + - `ST_Boundary(thing)` finds the boundary of `thing`. There is probably a interior version too. + - `ST_DWithin(a, b, distance)` returns true if `a` and `b` are within `distance` of each other. + - `ST_MakeBox2D(a, b)` create a box from two points. + + +One can find nearest neighbour with: + +```sql +select l.*, +ST_Distance (l.geo , ST_Point (5, 5)) as dist +from landscape as l +order by dist +limit 5 +``` + +# Introduction Spatial Queries + +**Range queries** are the most simple, as we just query inside a square. +This is very similar to something like + +```sql +SELECT * WHERE salary BETWEEN 10 AND 20 +``` + +Only with range queries we often look in two dimensions instead of one. + +**Directional relationships** is stuff like north, west or south, and can the done with `ST_Azimuth`. +However this does not make much sense in 3D. + +## Topological Relationships + +We first define boundary and interior. +**Boundary** are the pixels that are adjendant to a pixel outside the object. +While **Interior** is the space inside the object not occupied by the boundary. + +If we consider two shapes, we can say that they either *intersects* and *disjoint*. +We they are disjoint we cannot really say more than that, but if they *intersect* they can either be + + - *equal* if their boundaries and interior are equal, + - *inside* if the one interior is completely inside the other, + - *contains* is the reverse of *inside* + - *adjendant* if their borders share pixels, but not their interior, + - *overlaps* if they share interior pixels, but they each have pixels which are not in the other. + +```sql +ST_Contains (b.geo , y.geo) +ST_Within (b.geo , y.geo) +ST_Touches (b.geo , y.geo) +ST_Equals (b.geo , y.geo) +ST_Overlaps (b.geo , y.geo) +``` + +## Points, Linestring and Polygons + +These are used to represent the different things in the 2d database. +This is nicely explained [here](https://help.arcgis.com/en/geodatabase/10.0/sdk/arcsde/concepts/geometry/shapes/types.htm). + +Points +: A single point represented by coordinates. It boundary is empty and the interior is just the point. + +Linestring +: A line between a sequence of at least two points. + Can either be closed, where the begin point is also the end. + The line between the points is the interior, and the boundary are the outer points. + +Polygon +: Define by an exterior closed linestring and zero og more interior closed linestrings. + The exterior ring must be defined against the clock and the interior with the clock, such that the right side of the linestring is always outside. + See #c6fd for a nice figure. + +# Spatial Database Indexing + +We can specify the indexing method in sql like this. + +```sql +-- Creation of the table +create table landscape ( +geo_name varchar (20) primary key , +geo geometry not null +); + +-- With B+ tree +create index landscape_geo_name +on landscape ( geo_name ); + +-- With R-tree, GIST, generalized search trees +create index landscape_geo +on landscape using gist (geo ); +``` + +## Space Filling Curves + +*Also refered to as SFC.* + +If is often more convenient to store points in a specific 1d ordering, as we can then use existing indexing methods. +However the spatial stuff we want to store and search is 2d. +Therefore we map this 2d data to 1d, making search and inserting much easier. + +We do this by splitting the 2d space into smaller chunks, and then we find a ordering of these chunks. +The ordering of the chunks is determined by a space filling curve, which there are many different. + +A good spacefilling curve is one where points that are close in 2d space, are also close to each other on the curve. + +However because we have many points in the same chunk, we can not know for sure whether they overlap just because they are in the same chunk. +This is therefore only used for finding candidates. + +Using space filling curves allows us to reuse the B+tree indexing structure, which is well supported by postgresql. +However they can also be combined with other quadtrees or R-trees which work in 2d by themselves. + +## Trees + +Here we will introduce *minimum-bounding rectangles* or MBRs, which is a axis aligned rectange. +These are used to give an estimate of an object, which can lead to false positives (but not false negatives). + +We will then insert those bounding boxes into a into nodes, with a maximum of M boxes per tree node. +When more than M nodes are to be inserted in a box we split it in two. + +Again when deleting we should combine nodes if they are smaller than `m * M`. +Here it's often hard to get the right clusterings of boxes, as we want to mimimize the deadspace. + +`m` is often `0.4` and `M` is often set according to page size etc. + +When searching we use a two step filtering process. +First we check fast intersections with boxes and then the expensive geometric intersection. + +What now refine R trees as *R+trees* which require that there is no overlap between the node regions. +We will therefore have to insert some objects twice. +However this eliminates the need to do multi-path searching. + +# Introduction to Parallel Databases + +Lets start with some important definitions. + +Distributed Database +: Software system that permits management of the distributed database and makes the distribution transparent to the user. +Scale-up +: Adding more CPU cores etc hardware. +Scale-out +: Adding more smaller machines and connecting them. +Throughput +: Time from query start to last row received. +Response Time +: Time from query start to first row received. + +There is a nice slide about transistency in this lecture. +It does not really make sense to repeat it here. + +## Performance + +When performing a query we have many choices for where to do this. +**Inter-query** is when we have many smaller queries that are performed in parallel, +while **intra-query** is a very large query that is queried concurrently. + +## Design of Distributed Database + +Often done in a *3-tier* architecture, with a client layer, application layer, and database server layer. +Each layer can then have many different interconnected computers, with communication to adjendant layers via a network. + +## Partitioning + +**Horizontal partitioning** (also called *sharding*) is when we have different sets of rows of the same table on different sites. +We therefore preserve the whole scheme at each site. +We can partition these in different ways: + + - **Range partitioning**, by partitioning on some column such as timestamps. + - **List partitioning** such as using *round-robin*. + - **Hash partitioning**. + +Here we should watch out for getting a *data-skew*, where data is partitioned unbalanced. +This could for example be if we partition on a name column. + +**Vertical partitioning** is where each site has each row, but the partitioning happens on the columns. +This allows for better compressions, but we must repeat primary key rows accross all sites for identification. + +We must be able to say something about partitioning. +Such as **completeness** which states that all rows in the originals, should still be present in at least one of the partitions. +Or **reconstruction** which states that the original table can be recovered by applying some operator to all partitions. +And lastly **disjointness** which states that the partitions should not share rows. + +# Just Parallel Databases (what) + +Scaling up as explained before, is often much harder. +This is also called *vertical scaling* as is associated with *tight coupling*. + +Scaling out is easier and often gives a more losely coupled thing, however this comes at the cost of complexity. +Scaling out is also called *horizontal scaling*. + +**Shared memory** is a architecture where we have a shared memory between processors. +Here each CPU can see the whole memory space, and they all run the same OS instance. +Often used for *inter-query parallelism* or *load balancing*. +However not very common for databases. + +**Shared disk** is very common for databases in cloud computing such as Amazon S3. +Here many processors, each with their own memory, shares a set of disks via an interconnect. +Therefore disks and CPU's can be shared independently. + +**Shared nothing :-(** very common with NoSQL. +Here many computers are connected with an interconnect and share nothing, such like *scale-out*. +This is harder to scale than shared-disk but can give better performance. + +**Symmetric algorithm** whether the processing of incomming tuples can be pipelined with own tuples. +See page 368 for a better explanation. + +## ACID and CAP + +ACID stands for + + - Atomicity + - Consistency + - Isolation + - Durability + +While CAP stands for + + - Consistent (same as in ACID) + - Always Available + - Partitioned + +With CAP we are only allowed to have two of the three *things*. + +## Amdahl's Law + +The more of our computation we can do in parallel, the better. +If we can do all of it (which is impossible) we get a linear grow in performance as we add more units. +However if we cannot it will have a curve down :-(. + +There is a nice formula for this in the slides, but i can try to write it here. + +```haskell +MaxSpeedUp seq p = 1 / (seq + ( (1-seq) / p)) +``` + +We are interested in parallising sorting and joining. +These operate on rows that are all over the place, and it is beneficial to do as much as possible on the device where stuff is stored, before sending it. + +# Summary of Parallel Databases + +This repeats some things in a nicer way. +I will just write what can be found in the slides, instead of repeating it. + + - Comparison between partitioning techniques, such as advantages of hash etc. + - Interquery parallism + + Multiple queries in parallel. Easiest to implement on shared-memory. + + - Intraquery parallism + + - **Intraoperation parallism** parallizes the execution of each operation in query. + - **Interoperation parallism** execute different query operations in parallel. + +# Intro Data Warehousing and Multidimensional Modeling + +Reasons for data warehousing + + - Often data quality is bad (different formats, inconsistent data) + - Data is not permanent (deleted after 30 days, or overwritten by changes) + +This makes it hard to answer more bussiness oriented queries. + +If we consider the example of a single sale, then this sale would be a *fact* in the database, and the sales value would be called a *measure*. +Then this sales fact is associated with a set of *dimensions* such as the product type, the store and the sale time. + +Each of these dimensions are oriented in a hierarchy, such that we can "zoom in" or out for more or less detail. +For example with a timestamp, we can only be interested in week or year, but also zoom into specific days. + +Therefore we divide data into dimensions and facts. + +Facts +: an important entity such as a sale. Often contain a *measures* which are things that can be aggregated, such as sales price. +Dimensions +: describes a fact such that we can search for different things. For example where did the sale happen, who bought it etc. + +We can then say that facts live a *cube* with many dimensions (yeah that does not make such sense, so *hypercube* is also sometimes used). +Often these cubes have around 4-12 dimensions, but often 2 or 3 can be shown due to our stupid brains. +We denote a *sparse cube* as a cube with many empty cells and the opposite as a *dense cube*. + +The slides have a nice todo list of things todo when designing a data warehouse. + +## Types of Facts + +Event fact +: or an *transaction fact* is a bussiness event such as a sale. +Measureless fact +: without a numerical measure, so just something that happened for some combination of dimensions. +State fact +: captures the current status at some timestamp, such as inventory. + +## Granularity + +What does a single fact mean, like the *level of detail*, "is a single fact the total sales in a day, or every single sale?". +We often let the granularity be a single business transaction, such as a sale, however we can aggregate this for scalability. + +## More on Measures + +This is something that a user might want to study for optimize for. +Each measure has two components, namely a *numerical value* and a *aggregation formula*. + +There are different forms of measures: + + - **Additive measure** can be aggregated for all dimensions. A good example is the sale price. + - **Semi-additive** can be aggregated over some dimensions, but not all. + A good example is inventory, which cannot be aggregated over time (we would count stuff twice), but we can over aggregate over store to get total inventory. + - **Non-additive** cannot be aggregated over anything. For example if it is a average sales price. + +## Implementation + +Here the granuality is avery important to consider, "what does a single row represent". + +It is not very nice to store the whole dimension stuff with each fact, so instead we use *star schemas* and *snowflake schemas*, +with work with *fact tables* and *dimension tables*. + +A **fact table** stores facts, and facts only, with references to the dimension values in the *dimension tables*. +Here we use a *surrogate key* to reference dimension values, which should be meaningless. + +A **Star schema** has one fact table, and then one de-normalized dimension table for each dimension containing all levels. +This elliminates a lot of the foreign key usage, making stuff faster, however it hides the levels from the user and takes up more storage. + +A **snowflake schema** stores the dimensions normalized, in that each level has its own dimension table. + +In regards to *redundancy*, this is something we want to avoid, as it allows for inconsistent data and complexities with updating. + +# Advanced Multidimensional Modeling + +*warning: very very advanced* + + -- cgit v1.2.3