Early Data Analysis with Pandas

How to get a brief view on what data poses?

  • describe() provides various statistical information for each column (count, mean, std, etc.)
  • head() returns first (five by default) rows of DataFrame
  • info() returns summary of DataFrame such as data types, memory consumption and so on
  • count() returns series with number of  non-NA/null values for all columns
  • df[‘column’].value_counts() returns counts of unique values in a column
  • df[‘column’].nunique() returns number of unique values in a column
  • pandas.tools.plotting.scatter_matrix() creates scatter plots for given data frame
  • df[‘column’].hist() draws histogram of the column values using matplotlib
  • scipy.stats.probplot(array, plot=plt) draws probability plot to check that the data set follows a normal distribution
  • statsmodels.graphics.gofplots.qqplot(array, line=’s’) draws a QQ-plot
  • scipy.stats.shapiro(array)[1] returns p-value of the Shapiro-Wilk test for normality