motivation
狗群友都拿到amazon offer了, 我好酸啊! 而我的简历上还不知道写什么项目, 决定写一遍15445!!!
database basics
definition
organized collection of interrelated data that models some aspect of the real world.
example
假设我们需要创建一个数据库来models 一个数字音乐商店. 那么我们需要存储:
- 关于artists 的 info:
year, name, country
- artists的专辑albums:
name, year
- albums中的 不同 歌曲tracks:
name, number
- entity-relationship diagram:
可以清晰的知道 一个专辑可能有多个artists, 一张专辑可能有多首歌曲. 而一首歌曲一般只出现在一张专辑中.
使用csv来存储
第一个简单的想法就是使用csv来存储, 不同的属性值用逗号(,
)来分割, 每一个entity一个单独的文件;
使用这种方式的话我们的程序每次要读写这些记录的时候必须要parse这些csv文件;
比如如果在上面的存储方式中要查询ice cube
这个artist
的year
需要如下的代码:
1 | for line in file: |
flat files 存储的问题:
- 如果别人用一个invalid string覆盖了
album
的year属性怎么办? 那么你的string转int当作year来使用就会有问题;(aka, 没有约束) - 如果一个专辑有多个作家, 该怎么存储他们的名字呢? 先来一个数字吗? 复杂化parse程序?
- 如何查找一个特定的
record
? 通过string match吗? 效率太低! - 如果两个线程同时写一个csv file怎么办呢? 没并发控制!
- 当进行存储update的时候宕机了怎么办? 没log! 没法保证原子性
- 如果我们想要replicate这个database怎么办呢? 可扩展性不好! 不支持分布式
DBMS
database management system
is software that allows applications to store and analyze infomation in a database;
通常的dbms设计允许 定义
,创建
查询
更新
以及 对 databases的管理
;
On-line Transaction Processing (OLTP)
Fast operations that only read/update a small amount of data each time
On-line Analytical Processing(OLAP)
Complex queris that read a lot of data to compute aggreates
Hybrid Transaction + Analytical Processing
OLTP + OLAP together on the same database instance
DBMS types
relational model 关系模型
关系模型的三要素
- structure 结构:关系的定义及其结构内容。
- integrity 完整性:确保数据库的内容满足限制。
- manipulation 操纵:如何访问和修改数据库的内容。
relation
Relational databases are the most common data model because it is the most flexible.
tuple
A tuple is a sequence of attribute values in the relation. 也就是一条record
自我认知
关系就是表,tuple就是行,属性就是列
主键 primary key
A relation’s primary key uniquely identifies a single tuple
在artist表中并不能认为 name
就是主键, 万一有重名的歌手呢? 应该使用id
作为主键
MySQL
支持自增integer主键
外键 foreign key
A foreign key specifies that an attribute from one relation has to map to a tuple in another relation.
A的某个键是B的主键, 那么这个键就是A的外键!!!
例如在上图中: album_id
是album
的主键, 那么 album_id
是一个外键, 同理artist_id
也是外键(对, 主键也可以是外键);
conclusion
Databases are ubiquitous.
Relational databases are the most common data model because it is the most flexible.
关系代数
definition
Fundamental operations to retrive and manipulate tuples in a relation. (Based on set algebra)
Operators:
Select
:- choose a subset of tuples from arelation that satisfies a selection predicate.
- eg:
select * from R where id=01;
Projection
:- generate a relation with tuples that contains only the specified attributes.
- eg:
select name, year from R where id=01;
, compare difference with example above;
- Union:
- generate a relation that contains all tuples that appear in either only one or both of the input relations.
- eg:
select * from R union select * from S;
- In order to union two tables there are a couple of requirements:
- The number of columns must be the same for both select statements.
- The columns, in order, must be of the same data type.
- 在两个表中都出现的会重复显示, it does not remove duplicates
- Intersection:
- generate a relation that contains only the tuples that apper in both of the input relations;
- eg:
select * from R intersect select * from S;
- Difference:
- generate a relation that contains only the tuples that appear in the first and not the second of the input relations;
- eg:
select * from R except select * from S;
- Product:
- generate a relation that contains all possible combinations of tuples from the input relations;
- syntax:
select * from R cross join S
; orselect * from R, S;
- Join:
- generate a relation that contains all tuples that are a combination of two tuples(one from each input relation) with a common value(s) for one or more attributes.
- syntax:
select * from R natural join S;
Each operator takes one or more relations as its inputs and outpus a new relation. We can chain operators together to create more compelx operations.
cross join is the same thing as Cartesian product.
difference between inner join and intersect(quora):
Intersect is an operator and Inner join is a type of join. Intersect can return matching null values but inner join can’t. Intersect doesn’t return any duplicate values but inner join returns duplicate values if it’s present in the tables.