Sqoop daily Oracle data into Hive table partition

The following bash script can be used to import Oracle records into a Hive table, partitioned by date. It uses Sqoop. Both Hive and Sqoop are part of typical Hadoop distributions, like the Hortonworks Sandbox, for example.

#!/bin/sh

function upper() {
  echo "$1" | tr [a-z] [A-Z]
}

if [ $# -ge 5 ]; then
  schema=$(upper $1)
  table=$(upper $2)
  column_to_split_by=$(upper $3)
  date_column=$(upper $4)
  date_value="$5"
else 
  echo
  echo "Usage: $(basename $0) schema table column-to-split-by date-column YYYY-MM-DD"
  echo
  echo "Imports all records where value of date-column is \$date_value from"
  echo "Oracle table \$schema.\$table as a Hive table partition."
  echo "Hadoop will split the import job based on the column-to-split-by."
  echo "* The table must have the columns specified as column-to-split-by and date-column."
  echo "* The column-to-split-by must be finer granularity than date-column, ideally unique."
  echo "* The date_value must be in YYYY-MM-DD format."
  echo "* If date_value is unspecified, the current date will be used."
  exit 1
fi

echo "schema = $schema"
echo "table = $table"
echo "column_to_split_by = $column_to_split_by"
echo "date_column = $date_column"
echo "date_value = $date_value"

# we have to drop the partition, because --hive-overwrite does not seem to do it
hive -e "use $schema; alter table $table drop if exists partition($date_column='$date_value');"

columns=$( \
sqoop eval \
--options-file /usr/local/etc/sqoop-options.txt \
--query "select column_name from all_tab_columns where table_name = '$table'" \
| tr -d " |" \
| grep -Ev "\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-|COLUMN_NAME|$date_column" \
| tr '\n' ',' \
| sed -e 's/\,$//'
)

query="select $columns from $schema.$table \
       where $date_column = to_date('$date_value', 'YYYY-MM-DD') \
       and \$CONDITIONS"

echo "query = $query"

sqoop import \
--options-file "/usr/local/etc/sqoop-options.txt" \
--query "$query" \
--split-by "$column_to_split_by" \
--target-dir "$schema.$table" \
--hive-import \
--hive-overwrite \
--hive-table "$schema.$table" \
--hive-partition-key "$date_column" \
--hive-partition-value "$date_value" \
--outdir $HOME/java

JDBC connection details

Put them into /usr/local/etc/sqoop-options.txt, in a format like this:

--connect
jdbc:oracle:thin:@hostname:port:hostname
--username
oracle_username
--password
oracle_password

Reflexionen der Moderne im dramatischen Werk Ernst Tollers

About 12 years ago, in July 2001, I submitted my thesis “Zwischen Weltverbesserung und Isolation – Reflexionen der Moderne im dramatischen Werk Ernst Tollers” to complete my university degree in Mathematics and German Linguistics and Literature.

I wrote the document using Latex and GNU Emacs on a GNU/Linux system. It is available in PDF format.

The LaTex source files of the thesis are also available. The structure is very straightforward and uses predefined macro definitions that can be generally useful for writing essays, books and academic papers in the Liberal Arts.

For those who are fed up with their word processor messing with their layouts and prefer to just write plain text: Take a look at how simple, for example my introduction chapter is.

If you are interested, feel free to reuse my LaTeX macros in STYLE.tex.

The LaTeX code is compatible with TeX Live, version 2012. On Debian stable (wheezy) installation is as simple as

sudo apt-get install texlive texlive-latex-extra texlive-lang-german evince
wget https://github.com/odoepner/toller-moderne/archive/master.zip
unzip master.zip
cd toller-moderne-master/src/main/tex
pdflatex MAIN.tex
evince MAIN.pdf

Manage “recommended” dependencies with apt-get, debfoster and custom script

I use apt-get to install Debian packages. By default apt-get will also install all packages that your desired package depends on or that it recommends.

I find that recommended dependencies are often not actually necessary. I use debfoster to carefully review and selectively remove them, to keep my system light and clean. My approach requires this line in /etc/debfoster.conf:

UseRecommends = no

With this setting, debfoster will ignore recommended dependencies and allow you to decide individually if you want to keep them.

Disclaimer

This approach only makes sense if you know exactly what you are doing. Sometimes the removal of “recommended” dependencies can actually break functionality. If in doubt, tell debfoster to keep (Y) the respective packages or skip (s) the decision.

The prune (p) option offered by debfoster is the most drastic removal type and should be used with extreme caution.

Reinstall recommended dependencies

The ant-rdepends command can help you find out which packages recommend a given package (replace PACKAGE with the name of the package you are interested in):

sudo apt-get install apt-rdepends
apt-rdepends -pr --state-show Installed --state-follow Installed --show Recommends PACKAGE

If you ever remove too much, you can reinstall all dependencies (including the recommended ones) of a package using the following script. Save it for example as /usr/local/bin/install-dependecies.sh and use chmod ugo+x to make it executable.

#!/bin/sh

if [ $# -ne 1 ]; then
  echo "Usage: $(basename $0) package"
  exit 1
fi

package="$1"
header="Package $package depends on:"

df_output=$(debfoster -o UseRecommends=true -d $package)
pkg_list=${df_output#*$header}

if [ "$pkg_list" != "$df_output" ]; then
  sudo apt-get install $pkg_list
else
  echo $df_output
fi

Reference info

Install cygwin and cygwinports packages using apt-cyg

I recently started using apt-cyg and like it a lot. It allows easy installation of packages from Cygwin and Cygwin ports repositories.

It is inspired by Debian’s apt but it does not support anything like a sources.list and it can actually only work against one repo (mirror) at a time. Well, I guess that’s bearable since there are really only two sources (main cygwin and cygwin ports). But I might try apt-cyg-multi which claims to have added the missing cross-repository functionality.

For now, I found it convenient to configure these bash aliases (use a suitable Cygwin mirror and Cygwin Ports mirror for your location):

alias cyg='apt-cyg -m http://mirrors.kernel.org/sources.redhat.com/cygwin/'
alias cyp='apt-cyg -m http://mirrors.kernel.org/sources.redhat.com/cygwinports/'

Update, June 2013: There is a promising-looking apt-cyg fork on github. I just started using it.

Save video stream as file using rtmpdump (even on Windows)

The rtmpdump tool can help you save video streams as local files.

It comes with most Linux distributions, e.g. on Ubuntu or Debian (with sudo):
sudo apt-get install rtmpdump

On Windows, the most convenient way is probably via Cygwin and Cygwin Ports:

  1. Install Cygwin
  2. Install Cygwin Ports
  3. On the package selection screen, select “rtmpdump”

Read the manual page and study the options:
man rtmpdump

Common entries in .bash_aliases

I keep my bash aliases grouped in separate files. I include them from .bashrc like this:

# Alias definitions that should work everywhere
if [ -f ~/.bash_aliases ]; then
    . ~/.bash_aliases
fi

# Alias definitions that are specific to your distro
# e.g. Cygwin-only stuff, apt-get shortcuts on Debian, etc.
if [ -f ~/.bash_aliases.distro-specific ]; then
    . ~/.bash_aliases.distro-specific
fi

This is the ~/.bash_aliases that I commonly use:

# Interactive verbose operation...
alias rm='rm -iv'
alias cp='cp -iv'
alias mv='mv -iv'

# Default to human readable figures
alias df='df -h'
alias du='du -h'

# Misc :)
alias less='less -r'                          # raw control characters
alias whence='type -a'                        # where, of a sort
alias grep='grep --color'                     # show differences in colour
alias egrep='egrep --color=auto'              # show differences in colour
alias fgrep='fgrep --color=auto'              # show differences in colour

# Some shortcuts for different directory listings
alias ls='ls -hF --color=tty'                 # classify files in colour
alias dir='ls --color=auto --format=vertical'
alias vdir='ls --color=auto --format=long'
alias ll='ls -l'                              # long list
alias la='ls -A'                              # all but . and ..
alias l='ls -CF'                              #

This is the ~/.bash_aliases.distro-specific that I use on Debian (or other APT-based distros):

alias show='apt-cache show'
alias search='apt-cache search'

alias files='dpkg -L'
alias selections='dpkg --get-selections'

alias install='sudo apt-get install'
alias reinstall='sudo apt-get install --reinstall'

alias update='sudo apt-get update'
alias upgrade='sudo apt-get upgrade'

alias remove='sudo apt-get remove'
alias purge='sudo apt-get remove --purge'
alias autoremove='sudo apt-get autoremove'