The idea of spark Datafame may be inspired from dataframe of pandas which is a package of python for structure data processing. On my opinion, dataframe can by prefered by the people with BI(business intelligence) background for high development efficiency.
DataFrame in Spark could by registered as something which could be considered approximately as a virtual table, therefore anyone who has expierence of SQL could explore the data at quite a low cost of time.
This article will focus on some dataframe processing method without the help of registering a virtual table and executing SQL, however the corresponding SQL operations such as SELECT, WHERE, GROUPBY, MIN, MAX, COUNT, SUM ,DISTINCT, ORDERBY, DESC/ASC, JOIN and GROUPBY TOP will be supplied for a better understanding of dataframe in spark.
Firstly we make a DataFrame object a by reading a json file
and the content of people.json is as below
let us image a as a Table which is stored in a RDS database such as MySQL.
the three methods above are equivelent.
and the result is
and the result is
first we make another dataframe based on a
now we try to join a and c
the cording dataframe form is
what if those records whose c.age is null is execluded
the na.drop method provided this function
use window operation can help
what’s more, it is clearly
select * in SQL could by implemented by
As of Spark 2.0, the RDD-based APIs in the spark.mllib package have entered maintenance mode. The primary Machine Learning API for Spark is now the DataFrame-based API in the spark.ml package.
What are the implications?
MLlib will still support the RDD-based API in spark.mllib with bug fixes. MLlib will not add new features to the RDD-based API. In the Spark 2.x releases, MLlib will add features to the DataFrames-based API to reach feature parity with the RDD-based API. After reaching feature parity (roughly estimated for Spark 2.2), the RDD-based API will be deprecated. The RDD-based API is expected to be removed in Spark 3.0.
Why is MLlib switching to the DataFrame-based API?
DataFrames provide a more user-friendly API than RDDs. The many benefits of DataFrames include Spark Datasources, SQL/DataFrame queries, Tungsten and Catalyst optimizations, and uniform APIs across languages. The DataFrame-based API for MLlib provides a uniform API across ML algorithms and across multiple languages. DataFrames facilitate practical ML Pipelines, particularly feature transformations. See the Pipelines guide for details.