假设我有一个带有此数据的别名 transactions:
person store spent
A S 3.3
A S 4.7
B S 1.2
B T 3.4
我想知道有多少不同的人去了每家商店以及他们在那里花了多少钱:
store visitors revenue
S 2 9.2
T 1 3.4
我希望我可以一步完成:
stores = foreach (group transactions by store) generate
group as store, SUM(transactions.spent) as revenue,
COUNT(UNIQUE(transactions.person)) as visitors;
但它看起来不像是UNIQUE这样的东西。
我是否坚持了两步流程?
tr1 = foreach (group transactions by (store,person)) generate
group.store as store, SUM(spent) as revenue;
stores = foreach (group tr1 by store) generate
group as store, COUNT(tr1) as visitors, SUM(revenue) as revenue;
请您参考如下方法:
这里有两种方法
1) 使用 Distinct 内置 UDF(不是 DISTINCT pig 运算符)。抱歉,我没有代码示例,我不知道它将如何执行。
2) 使用带有 DISTINCT 运算符的嵌套 foreach 像这样:
stores = FOREACH (GROUP transactions BY store) {
uniqueVisitors = DISTINCT visitors;
GENERATE
group AS store,
COUNT(uniqueVisitors) AS visitors,
SUM(revenue) AS revenue;
}
第二种方法的好处是它不应该禁用 COMBINER: http://pig.apache.org/docs/r0.11.1/perf.html#When+the+Combiner+is+Used

