Skip to main content
 首页 » 编程设计

apache-pig之计算包中的不同元素

2025年05月04日105birdshome

假设我有一个带有此数据的别名 transactions:

person  store  spent 
A       S      3.3 
A       S      4.7 
B       S      1.2 
B       T      3.4 

我想知道有多少不同的人去了每家商店以及他们在那里花了多少钱:

store   visitors  revenue 
S       2         9.2 
T       1         3.4 

我希望我可以一步完成:

stores = foreach (group transactions by store) generate 
  group as store, SUM(transactions.spent) as revenue,  
  COUNT(UNIQUE(transactions.person)) as visitors; 

但它看起来不像是UNIQUE这样的东西。

我是否坚持了两步流程?

tr1 = foreach (group transactions by (store,person)) generate 
  group.store as store, SUM(spent) as revenue; 
stores = foreach (group tr1 by store) generate 
  group as store, COUNT(tr1) as visitors, SUM(revenue) as revenue; 

请您参考如下方法:

这里有两种方法

1) 使用 Distinct 内置 UDF(不是 DISTINCT pig 运算符)。抱歉,我没有代码示例,我不知道它将如何执行。

2) 使用带有 DISTINCT 运算符的嵌套 foreach 像这样:

stores = FOREACH (GROUP transactions BY store) { 
    uniqueVisitors = DISTINCT visitors; 
    GENERATE 
        group AS store, 
        COUNT(uniqueVisitors) AS visitors, 
        SUM(revenue) AS revenue; 
} 

第二种方法的好处是它不应该禁用 COMBINER: http://pig.apache.org/docs/r0.11.1/perf.html#When+the+Combiner+is+Used