[Cialug] a bit OT: working with big data

Tim Champion timchampion at gmail.com
Thu Nov 15 11:16:21 CST 2012


If you want to group the 4 years together, it would be something like this:
ROUND(year / 4) as every_four

Depending on what you want your year ranges to be, you might have to shift
that by adding or subtracting from the year.
ROUND((year + 2) /4) as every_four
or something like that.

Tim Champion
timchampion at gmail.com


On Thu, Nov 15, 2012 at 10:53 AM, Tim Champion <timchampion at gmail.com>wrote:

> I know nothing about this MapReduce framework, but I can help with SQL.
>
> For your "every 4 years" requirement, you could use the modulo operator
> MOD like so:
> (instead of substr(year, 0, 3) as decade, make it: year MOD 4 as every_four
> then group by every_four instead of decade.
>
> Is that what you are looking for? or do you want 4 year blocks grouped
> together? 2001-2004, 2005-2008, etc.
>
> I'm not clear on what you want to do for your state table. What do you
> want that table structure to look like specifically? In general, that would
> likely group by state, but not sure what other fields you're looking for.
>
> Tim Champion
> timchampion at gmail.com
>
>
>
> On Wed, Nov 14, 2012 at 8:00 PM, Brett Neese <brneese at brneese.com> wrote:
>
>> Hey all,
>>
>> Does anyone in this group have experience working with Hive/Hadoop
>> MapReduce frameworks?
>>
>> I'm currently working with this how-to
>> <http://aws.amazon.com/articles/5249664154115844>and associated data-set
>> and am having some troubles trying to figure out how to structure my
>> queries.
>>
>> My needs diverge around step "Word Ratio By Decade" -- namely, a) I would
>> not like to limit to decade marks when I create my new table, instead, I
>> would prefer to limit to every four years and b) Instead of using the
>> entire corpus, I would like to create a table that is limited to the names
>> of the states (for instance "Iowa" and "New York").  The current query is
>>
>>
>> INSERT OVERWRITE TABLE by_decade
>> SELECT
>>  a.gram,
>>  b.decade,
>>  sum(a.occurrences) / b.total
>> FROM
>>  normalized a
>> JOIN (
>>  SELECT
>>   substr(year, 0, 3) as decade,
>>   sum(occurrences) as total
>>  FROM
>>   normalized
>>  GROUP BY
>>   substr(year, 0, 3)
>> ) b
>> ON
>>  substr(a.year, 0, 3) = b.decade
>> GROUP BY
>>  a.gram,
>>  b.decade,
>>  b.total;
>>
>> And I have very limited SQL skills and have no idea how to go about
>> adapting that. I do know I will need to manually hand code in a WHERE
>> clause to limit to the names of the states.
>>
>> I would then like to take that raw data and export it out of Hadoop for
>> further manipulation.
>>
>> Can anyone help me here? I greatly appreciate it.
>>
>> --
>> Brett Neese
>> 563-210-3459
>> http://brneese.com
>> _______________________________________________
>> Cialug mailing list
>> Cialug at cialug.org
>> http://cialug.org/mailman/listinfo/cialug
>>
>
>


More information about the Cialug mailing list