Collecting Data

Creating Structured Data Sets for Visualization

These are essential tips to guide you in collecting and generating data sets that you will be able to analyze and visualize using a flexible variety of existing and future software tools and media formats. They also apply to collecting data sets for mapping geolocation data.

  • Write your notes using a plain text editor such as TextEdit, or maintain a backup set of your fieldnotes in a plain text format (e.g. .TXT files). Keeping an archive of your notes in a plain text format will enable you to cleanly import your field note data, or copy and paste selected notes, into specialized software and retain them for use in future versions of writing software.
  • Create or Save structured data from spreadsheets in plain formats such as .TSV (text separated values) or .CSV (comma separated values). The formatting issues described above for word processing software applies to structured and tabulated data, too. Thus, exporting and saving your data as .TSV or .CSV files will ensure access to your data from a wider variety of existing and future visualization tools.
  • Use a consistent and cogent set of categories and units to describe your data whether you are collecting measurable or descriptive data.
  • Ensure that column names are the same for the same types of data if you are using multiple worksheets, workbooks or data sets. For example, if the data for “year of birth,” “birthday,” and “year” are the same type of data, they should be defined as a single term in all your data sets. This consistency is especially important for joining and analyzing data sets that are derived from multiple sources or time periods.
  • Avoid leaving empty spaces in column headers that describe data, as required by some software. For example, age (months) might be age_months; “birthday” could be “birthdate”.
  • Review and clean up tabular data. Whether you collect them or generate them yourself, spreadsheet tables may not be ready for accurate visual analysis. Despite the neat gridded layout, data gets messy! Check for errors, obvious outliers, typos and erroneous empty (null) rows. Ensure that columns are formatted as data types that correspond to how you will use their data. For example, check that dates and currency are defined as such. Change numbers to text for columns with numbers that are actually not measurable quantities to be computed. Zip codes or medical codes, for example. Remove any pre-aggregated data that is not part of the raw data itself, such as totals or sub-totals that contain sums, averages, counts, etc. Remove introductory text such as titles or legends which might appear apart from your column headers, and flatten any sub-headers by creating a new  columns for major headers in the hierarchy. Finally, remove blank rows; check where white spaces may appear in your headers and data; trim leading and trailing whitespaces and collapse consecutive whitespaces. OpenRefine is an excellent free tool for managing these issues and for cleaning and organizing datasets before you import them into visualization or mapping software.
  • Keep a record of your data sources and record the last time the data set was collected, edited or published, and when you accessed or generated the data.
  • You can extract tabulated data from PDFs and save as CSV tables using Tabula.