Skip to content

Commit ca047f6

Browse files
abdulelahsmdevin-petersohn
authored andcommitted
DOCS-#2433: Updated README.md with modin_vs_dask.md doc
Signed-off-by: Abdulelah S. Al Mesfer <[email protected]>
1 parent 40ae5a8 commit ca047f6

File tree

2 files changed

+33
-1
lines changed

2 files changed

+33
-1
lines changed

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -180,7 +180,7 @@ and improve:
180180
![Architecture](docs/img/modin_architecture.png)
181181

182182
Visit the [Documentation](https://modin.readthedocs.io/en/latest/developer/architecture.html) for
183-
more information!
183+
more information, and checkout [the difference between Modin and Dask!](https://github.com/modin-project/modin/tree/master/docs/modin_vs_dask.md)
184184

185185
**`modin.pandas` is currently under active development. Requests and contributions are welcome!**
186186

docs/modin_vs_dask.md

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
# What is the difference between Dask DataFrame and Modin?
2+
3+
**The TL;DR is that Modin's API is identical to pandas, whereas Dask's is not. Note: The projects are fundamentally different in their aims, so a fair comparison is challenging.**
4+
5+
## API
6+
7+
### Dask DataFrame
8+
9+
Dask DataFrame does not scale the entire pandas API, and it isn't trying to. See this explained in their documentation [here](http://docs.dask.org/en/latest/dataframe.html#common-uses-and-anti-uses)
10+
11+
Dask DataFrames API is also different from the pandas API in that it is lazy and needs .compute() to materialize the DataFrame. This makes the API less convenient but allows to do certain query optimizations/rearrangement, which can give speedups in certain situations. We are planning to incorporate similar capabilities into Modin but hope we can do so without having to change the API. We will outline plans for speeding up Modin in an upcoming blog post.
12+
13+
### Modin
14+
15+
Modin attempts to parallelize as much of the pandas API as is possible. We have worked through a significant portion of the DataFrame API. It is intended to be used as a drop-in replacement for pandas, such that even if the API is not yet parallelized, it is still defaulting to pandas.
16+
17+
## Architecture
18+
19+
### Dask DataFrame
20+
21+
Dask DataFrame has row-based partitioning, similar to Spark. This can be seen in their [documentation](http://docs.dask.org/en/latest/dataframe.html#design.) They also have a custom index object for indexing into the object, which is not pandas compatible. Dask DataFrame seems to treat operations on the DataFrame as MapReduce operations, which is a good paradigm for the subset of the pandas API they have chosen to implement.
22+
23+
### Modin
24+
25+
Modin is more of a column-store, which we inherited from modern database systems. We laterally partition the columns for scalability (many systems, such as Google BigTable already did this), so we can scale in both directions and have finer grained partitioning. This is explained at a high level in [Modin's documentation](https://modin.readthedocs.io/en/latest/architecture.html). Because we have this finer grained control over the partitioning, we can support a number of operations that are very challenging in MapReduce systems (e.g. transpose, median, quantile).
26+
27+
## Modin aims
28+
29+
In the long-term, Modin is planned to become a DataFrame library that supports the popular APIs (SQL, pandas, etc.) and runs on a variety of compute engines and backends. In fact, a group was able to contribute a dask.delayed backend to Modin already in <200 lines of code [PR](https://github.com/modin-project/modin/pull/281).
30+
31+
32+
- Reference: [Query: What is the difference between Dask and Modin? #515](https://github.com/modin-project/modin/issues/515)

0 commit comments

Comments
 (0)