Spark
2016-12-15
2016-12-09
The idea of spark Datafame may be inspired from dataframe of pandas which is a package of python for structure data processing. On my opinion, dataframe can by prefered by the people with BI(business intelligence) background for high development efficiency.
DataFrame in Spark could by registered as something which could be considered approximately as a virtual table, therefore anyone who has expierence of SQL could explore the data at quite a low cost of time.
This article will focus on some dataframe processing method without the help of registering a virtual table and executing SQL, however the corresponding SQL operations such as SELECT, WHERE, GROUPBY, MIN, MAX, COUNT, SUM ,DISTINCT, ORDERBY, DESC/ASC, JOIN and GROUPBY TOP will be supplied for a better understanding of dataframe in spark.
prepare test data
Firstly we make a DataFrame object a by reading a json file
and the content of people.json is as below
let us image a as a Table which is stored in a RDS database such as MySQL.
desc
|
|
|
|
SELECT
|
|
|
|
the three methods above are equivelent.
WHERE
|
|
|
|
MIN,MAX,SUM,COUNT
|
|
|
|
and the result is
COUNT DISTINCT
|
|
|
|
and the result is
ORDERBY desc
|
|
|
|
|
|
inner join, left outer join and convert null to a default value
first we make another dataframe based on a
now we try to join a and c
|
|
the cording dataframe form is
what if those records whose c.age is null is execluded
the na.drop method provided this function
Top N for group
use window operation can help
|
|
what’s more, it is clearly select *
in SQL could by implemented by select($"*")
2016-12-05
|
|
2016-11-28
2016-08-19
Spark 2.0 MLib Introduction
As of Spark 2.0, the RDD-based APIs in the spark.mllib package have entered maintenance mode. The primary Machine Learning API for Spark is now the DataFrame-based API in the spark.ml package.
Spark2.0 ,在spark.mllib中的基于RDD的机器学习APIs将会进入维护模式。现在机器学习的主要的API基于DataFrame,位于spark.ml中。
What are the implications?
MLlib will still support the RDD-based API in spark.mllib with bug fixes.
MLlib will not add new features to the RDD-based API.
In the Spark 2.x releases, MLlib will add features to the DataFrames-based API to reach feature parity with the RDD-based API.
After reaching feature parity (roughly estimated for Spark 2.2), the RDD-based API will be deprecated.
The RDD-based API is expected to be removed in Spark 3.0.
Why is MLlib switching to the DataFrame-based API?
DataFrames provide a more user-friendly API than RDDs. The many benefits of DataFrames include Spark Datasources, SQL/DataFrame queries, Tungsten and Catalyst optimizations, and uniform APIs across languages.
The DataFrame-based API for MLlib provides a uniform API across ML algorithms and across multiple languages.
DataFrames facilitate practical ML Pipelines, particularly feature transformations. See the Pipelines guide for details.
2016-08-19
How to save a spark DataFrame as a patitioned hive table
utilise saveAsTable method
|
|
2016-08-11
|
|
2016-04-25
spark 的各种不同的transformation操作,可以根据是否依赖父RDDs的所有partision分为‘窄依赖’和‘宽依赖’,简单的说,有shuffle操作的就是宽依赖,而没有shuffle操作的就是窄依赖。
对于窄依赖,spark会尽量将他们划分为同一个stage,而宽依赖则会称为另外的stage。