Design A Data Pipeline
~2 mins read
Goals
- Ingest data from different sources
- Run a graph of SQL transforms to prepare it for modeling
- Persist clean results in a target store
- Make long-running operations async
- Make it easy to follow the progress, logs, and errors
- Multiple flows can run at the same time
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
class SourceTypes(Enum):
DB = 1
File = 2
class ConnectionInfo:
...
class DataStore:
source_type: SourceTypes
connection_info: ConnectionInfo
class DBConnection(ConnectionInfo):
host: str
port: int
user: str
password: str
db_name: str
class FileConnection(ConnectionInfo):
path: str
class DBStore(DataStore):
source_type: SourceTypes.DB
connection_info: DBConnection
class FileStore(DataStore):
source_type: SourceTypes.File
connection_info: FileConnection
class FlowStates(Enum):
READY = 1
RUNNING = 2
CANCELLED = 3
FAILED = 4
COMPLETED = 5
class Progress:
percent: int
started: datetime
updated: datetime
class Flow:
sources: list[DataStore]
target: DataStore
state: FlowStates
progress: Progress
logs: list[str]
errors: list[str]