Big Idea 2: Data
Binary Data: Bits, Bytes, and Numbers
Computers are physical devices built from circuits that are easiest to design when they only have to reliably distinguish between two states (on/off, high/low voltage, true/false). This is why nearly all digital data is represented using binary numbers, which use only two digits: 0 and 1.
A single binary digit is a bit (short for “binary digit”). A bit is the smallest unit of information stored or manipulated on a computer, and it can represent exactly one of two possibilities. Since one bit is too small for most real data, bits are grouped. A byte is a group of 8 bits and is a common unit for storage and memory. One byte can represent:
2^8 = 256
different patterns (from 00000000 to 11111111). If interpreted as an unsigned integer, that corresponds to values 0 through 255.
Understanding binary matters because nearly everything else in this unit builds on it: numbers are stored as bit patterns, text is stored as numeric character codes, images and sound are stored as sequences of numbers, and compression reduces how many bits are needed to store or transmit data.
Place value in binary (how numbers are encoded)
Binary uses the same place-value idea as decimal, but each position represents a power of 2 instead of a power of 10. For a binary number with bits b_n b_{n-1} ... b_1 b_0 (each b_i is 0 or 1), the value is:
\text{value} = \sum_{i=0}^{n} b_i \cdot 2^i
For example, binary 10110 means:
- 1\cdot 2^4 = 16
- 0\cdot 2^3 = 0
- 1\cdot 2^2 = 4
- 1\cdot 2^1 = 2
- 0\cdot 2^0 = 0
Total:
16 + 4 + 2 = 22
A common misconception is thinking binary “works differently” from decimal. It’s the same place-value concept—just a different base.
Base conversion (binary ↔ decimal)
Binary numbers are rarely used directly in everyday life, so programmers and computer scientists must be able to convert between binary (used internally by computers) and decimal (used by people). The key idea is always the same: each binary digit represents a different power of 2.
Binary to decimal (method: powers of 2)
To convert binary to decimal, add the powers of 2 for every position that contains a 1.
Example: 1101
1101_2 = 8 + 4 + 0 + 1 = 13
Decimal to binary (two common methods)
You may see either method on AP-style questions; both produce the same result.
Method A: repeated division by 2
- Divide the decimal number by 2.
- Record the remainder (0 or 1).
- Use the quotient and repeat until the quotient is 0.
- Read remainders from last to first.
Example: Convert 13 to binary.
- 13 ÷ 2 = 6 remainder 1
- 6 ÷ 2 = 3 remainder 0
- 3 ÷ 2 = 1 remainder 1
- 1 ÷ 2 = 0 remainder 1
Reading from last to first gives 1101.
Method B: build using powers of 2 (largest power down)
- Find the largest power of 2 that is less than or equal to the decimal number.
- Subtract it.
- Repeat until you reach 0.
- Mark 1s for the powers you used and 0s for the ones you skipped.
Example: Convert 200 to binary.
- Largest power of 2 ≤ 200 is 128, remainder is 72.
- Next power 64 fits, remainder is 8.
- Next powers 32, 16 do not fit.
- Power 8 fits, remainder is 0.
So 200 = 128 + 64 + 8, which corresponds to bits for 128, 64, 32, 16, 8, 4, 2, 1:
200_{10} = 11001000_2
How many values can you represent with a fixed number of bits?
Computers store data using a fixed number of bits, which creates limits. With n bits, you can represent exactly:
2^n
different bit patterns.
If those patterns represent unsigned (nonnegative) integers, the range is:
0 \text{ to } 2^n - 1
Example: with 5 bits, there are 2^5 = 32 patterns, so you can represent 0 through 31.
Overflow (what goes wrong and why)
Overflow happens when the true mathematical result needs more bits than the computer has available. For example, if you store values in 8 bits (0–255) and add 1 to 255, the correct answer is 256, but 256 needs 9 bits (100000000). If you only keep the lowest 8 bits, you end up with 00000000, which is 0.
On the AP CSP exam, you’re generally expected to recognize that fixed-size storage implies a maximum representable value, and exceeding it can cause incorrect results.
Representing real numbers (why approximation is unavoidable)
Whole numbers are relatively straightforward in binary, but real numbers (numbers with fractional parts) are tricky. Many values cannot be represented exactly with a finite number of bits.
Computers commonly use floating-point representation (similar in spirit to scientific notation), but the key takeaway is that some decimal fractions become repeating fractions in binary. Storing a repeating fraction with finite bits forces rounding, and rounding can cause small errors that sometimes add up. For example, in base 10, 1/3 = 0.3333... repeats forever; similarly, many “simple” decimals repeat in base 2.
Exam Focus
- Typical question patterns:
- Convert between binary and decimal, or determine the value of a binary pattern.
- Determine how many values n bits can represent, or the max value 2^n - 1.
- Reason about overflow or why a computation could be inaccurate due to limited bits.
- Common mistakes:
- Mixing up “number of bits” with “largest number” (values count is 2^n; max unsigned is 2^n - 1).
- Forgetting that bit positions represent powers of 2 starting at 2^0 on the right.
- Assuming real-number computations are always exact rather than approximations.
Encoding Text: From Characters to Numbers
When you type a letter, the computer does not store the literal shape of that letter. It stores a number that stands for the character. This is what makes text searchable, editable, and transmittable across networks.
Character encoding (what it is)
A character encoding is a rulebook that maps characters (letters, digits, punctuation, emojis, characters from many languages) to numeric values. Once a character is represented as a number, it can be stored as bits.
Two major examples are ASCII (older, focused on basic English characters) and Unicode (a much larger standard designed to represent writing systems worldwide). The key understanding is not memorizing charts, but recognizing that text is stored in binary via numeric codes, and that different encodings exist.
Why encodings matter (compatibility and meaning)
If two systems don’t agree on an encoding, the same bit pattern can be interpreted as different characters. That’s why opening a file with the wrong encoding can produce garbled text.
Encodings also affect storage size. Some use more bits per character, and some use variable-length sequences, so the same message may take more space depending on what characters appear.
How text becomes bits (a conceptual pipeline)
A helpful way to think about text storage is:
- Start with characters (like “CSP”).
- Use an encoding to map each character to a number.
- Convert each number to binary.
- Store/transmit the bits.
Example: reasoning about storage size for text
If a system uses 1 byte per character (a simplified assumption often used in introductory contexts), then a 100-character message uses 100 bytes.
If an encoding uses 2 bytes per character for some characters, then those characters increase storage. A subtle misconception is believing “a character is always a byte.” In reality, storage depends on encoding.
Exam Focus
- Typical question patterns:
- Explain how text is represented in a computer (characters mapped to numbers, then to bits).
- Compare encodings conceptually (smaller set vs broader character support).
- Reason about storage implications when bits per character changes.
- Common mistakes:
- Treating text as “stored as letters” rather than numeric codes.
- Assuming all characters always take the same number of bits in every encoding.
- Confusing a font (how text looks) with an encoding (how text is stored).
Representing Images: Pixels, Color, and Metadata
An image on a computer is usually represented as a grid of tiny colored squares called pixels (short for “picture elements”). The computer stores the color of each pixel as bits and then uses those stored values to display colored light on a screen.
Raster images (the pixel grid idea)
Most everyday images (photos, screenshots) are raster images, made of pixels arranged in rows and columns. Two properties drive both quality and file size:
- Dimensions / resolution: how many pixels wide and tall (for example, 1920 by 1080).
- Color depth (bits per pixel): how many bits are used to store each pixel’s color.
More pixels usually means more detail, but also more data.
Black-and-white images as bits (a simple mental model)
A simple way to see image representation is with a black-and-white image where 1 means black (on) and 0 means white (off). You can imagine “drawing” the image by creating a grid and coloring squares based on the 0s and 1s.
Before you can correctly interpret those bits, you must know the grid size (how many pixels across and down). That information is part of the image’s metadata. For example, metadata might specify that an image is 10 × 10, meaning 10 pixels across and 10 pixels down; without that, the same stream of bits could be grouped into the wrong rows and show the wrong picture.
Color representation (RGB and channel ranges)
Images are not often just black and white. To represent color, computers still use binary numbers, but they store more bits per pixel. Color on screens is based on light, and many systems represent colors by mixing red, green, and blue light (RGB).
A common approach uses 8 bits per channel. That means each channel can range from 0 to 255 in decimal:
- maximum channel value 255 corresponds to binary 11111111
- minimum channel value is 0
If each channel uses 8 bits, the pixel uses 24 bits total. More bits per pixel means more possible colors and smoother gradients, but larger file sizes.
With b bits per pixel, the number of distinct colors representable is:
2^b
So with 24-bit color, there are:
2^{24}
possible colors.
Uncompressed image size (a practical formula)
Ignoring metadata and compression, an approximate uncompressed size is:
\text{bits} = \text{width} \cdot \text{height} \cdot \text{bitsPerPixel}
To convert bits to bytes:
\text{bytes} = \frac{\text{bits}}{8}
Example: A 100 by 100 image with 24 bits per pixel has:
- bits = 100 \cdot 100 \cdot 24 = 240000
- bytes = 240000 / 8 = 30000 bytes (about 30 KB)
Metadata (data about data)
Many image files also store metadata, which is extra information about the image rather than the visible pixels. Examples include dimensions, date/time created, camera settings, and location (GPS) in some photos.
Metadata is useful for organization and functionality, but it can also raise privacy concerns (for example, a shared photo revealing where it was taken). A common misconception is to treat metadata as “not real data.” It is data, and it can be sensitive.
Exam Focus
- Typical question patterns:
- Predict what happens to file size when resolution or color depth changes.
- Explain how an image can be represented with pixels and bits.
- Reason about what metadata is (including size information like width × height) and why it matters.
- Common mistakes:
- Assuming “higher resolution” always means “better” without acknowledging larger storage and slower transmission.
- Forgetting that bits per pixel controls how many colors are representable.
- Ignoring metadata as a possible privacy risk or as required information to interpret pixel data.
Representing Sound: Sampling, Bit Depth, and File Size
Sound in the real world is continuous, but computers store discrete bits. So computers approximate sound by turning a continuous wave into a sequence of measurements.
Analog vs digital signals
An analog signal exists throughout a continuous interval of time and can take on a continuous range of values. A digital signal is a sequence of discrete symbols. When those symbols are 0s and 1s, they are bits.
Because digital signals use discrete symbols, they are not continuous in time and not continuous in their range of values. A major advantage is that digital signals are generally more resilient against noise than analog signals during storage and transmission.
Sampling (turning a continuous signal into data)
Sampling is recording an analog signal at regular discrete moments and converting those measurements into a digital signal. Each measurement records the wave’s amplitude at that moment.
The sampling rate is the number of samples taken per second (often expressed in Hz). More samples per second usually means a more accurate representation of the original sound (especially higher-frequency details), but it increases the amount of data.
Quantization and bit depth (how precise each sample is)
Each sample must be stored using a fixed number of bits. The number of bits per sample is the bit depth.
- Higher bit depth means more possible amplitude levels.
- More levels usually means less quantization error (less “graininess” or rounding noise).
With b bits per sample, you can represent:
2^b
different amplitude levels.
Sound file size (uncompressed)
Ignoring metadata and compression, a common approximation for uncompressed audio size is:
\text{bits} = \text{seconds} \cdot \text{samplesPerSecond} \cdot \text{bitsPerSample} \cdot \text{channels}
where channels is 1 for mono and 2 for stereo.
Example: 10 seconds of mono audio, 8000 samples/second, 8 bits/sample:
- bits = 10 \cdot 8000 \cdot 8 \cdot 1 = 640000
- bytes = 640000 / 8 = 80000 bytes
In many questions, you can reason directionally: increasing duration, sampling rate, bit depth, or channels increases file size.
What can go wrong: under-sampling and distortion
If you sample too slowly, you lose information about the original wave, which can cause distortion or missing high-frequency content.
It’s also important not to confuse sampling with compression. Sampling converts analog sound into digital data; compression represents that digital data using fewer bits.
Exam Focus
- Typical question patterns:
- Describe how sound is represented digitally using sampling and bit depth.
- Predict how file size changes when sampling rate, bit depth, duration, or channels change.
- Compare tradeoffs: quality versus storage/transmission time.
- Common mistakes:
- Confusing sampling rate (how often) with bit depth (how precise each sample is).
- Forgetting to account for channels (mono vs stereo) in size reasoning.
- Assuming higher values are always better without recognizing costs.
Data Compression: Lossless, Lossy, and Tradeoffs
Digital media can get large quickly, so data compression reduces the number of bits needed to represent information. Compression is used everywhere: MP3, MP4, RAR, ZIP, JPG, and PNG files (among many others) involve compression.
Compression matters for saving disk space, reducing bandwidth when sending data over the Internet, and making backups/archives smaller.
Compression is a two-way process
Compression is a two-way process: a compression algorithm makes a data package smaller, and a decompression algorithm reverses that process to reconstruct the data.
More precisely, compression often takes a string of bytes and represents it as a smaller set of bytes so it takes less storage to keep or less bandwidth to transmit.
What compression really does (the core idea)
Compression takes advantage of patterns or redundancy. If something repeats, you may not need to store each repeated piece separately. For media such as images and audio, compression can also remove information that people are unlikely to notice.
A useful mental model is that compression changes the representation. With lossless methods the meaning and the exact data are preserved; with lossy methods the meaning you care about may be preserved, but the exact original bits are not.
Lossless compression (exact reconstruction)
Lossless algorithms can reconstruct the original message exactly from the compressed message. This is essential for cases where every bit matters, especially text compression and many program/scientific/financial files. Even tiny text differences can change meaning dramatically.
One simple lossless strategy is run-length encoding, which stores “value + count” for repeated sequences. For example:
- Original: AAAAABBBCC
- Compressed idea: A5 B3 C2
Not all data compresses well losslessly. Highly random-looking data may have few patterns to exploit.
Lossy compression (approximate reconstruction)
Lossy compression does not decompress digital data back to 100% of the original. Lossy methods can provide high degrees of compression and result in smaller files by permanently removing some information.
This loss is not “random missing pixels.” It is typically loss of information considered less important, such as removing certain frequency components in audio or removing details the human eye is less likely to notice.
Lossy compression is common for:
- photographs (JPEG)
- music/audio streams (MP3)
- video streaming
High image compression loss can often be observed in photos when enlarged. In music, you can often hear a difference between an MP3 and a high-resolution audio file. For video, moving frames can often tolerate more loss of pixel-level detail than a single still image.
A common misconception is that “lossy means lower quality every time you open it.” The loss happens when you compress (and especially if you repeatedly recompress). Simply opening and closing a lossy file does not necessarily worsen it unless it is recompressed again.
Compression ratios and tradeoffs
A compression ratio describes how much smaller the compressed version is compared to the original. For AP CSP, you don’t need a single required formula, but you should be comfortable reasoning that a higher compression ratio means a smaller file, and that lossy methods can usually compress more than lossless methods.
Key tradeoffs:
- Storage vs quality: lossy compression reduces fidelity.
- Time vs space: compressing/decompressing takes computation time.
- Compatibility: systems must support the same format to read the data.
Why some data compresses better than others
Compression works best when data is predictable. Large areas of solid color in an image or repeated patterns in text compress well; a photo full of random noise is harder to compress.
Exam Focus
- Typical question patterns:
- Distinguish lossless vs lossy compression and identify when each is appropriate (text vs images/sound/video).
- Explain a tradeoff scenario (faster streaming vs reduced quality).
- Reason about why some files compress more effectively than others.
- Common mistakes:
- Saying lossy compression “doesn’t change the data” (it discards information).
- Assuming compression always reduces size by the same amount regardless of content.
- Forgetting that compression can add processing time even while saving storage/bandwidth.
Working with Data Sets: Collection, Cleaning, and Context
A data set is a collection of related data, often organized like a table with rows and columns. Data sets drive decisions in business, medicine, public policy, sports, and science, but conclusions are only as good as the data and the context.
Data, information, and knowledge
It helps to distinguish:
- Data: raw values (numbers, text, measurements, clicks, locations).
- Information: data processed/organized to be meaningful (summaries, trends, labeled charts).
- Knowledge: conclusions or decisions based on information (policy changes, diagnoses, redesigns).
Computers can process data at scale, but humans still define questions and interpret meaning.
How data is collected (and how bias can enter)
Data can be collected through surveys/forms, sensors (temperature, motion, GPS), transaction logs (purchases, website clicks), experiments, and user-generated content.
Collection methods can introduce bias. Common sources include:
- Sampling bias: the people/items measured are not representative.
- Measurement bias: the tool or method skews results (such as a poorly worded question).
- Survivorship bias: focusing only on visible “successes.”
Bias can lead to unfair or incorrect conclusions, especially when used to build algorithms that affect real people.
Data cleaning (what it is and why it matters)
Real data is messy. Data cleaning improves data quality before analysis and may include removing whitespace/symbol issues, fixing inconsistent formats (like “CA” vs “California”), removing duplicates, handling missing values, and identifying error-caused outliers.
Even a correct computation can produce misleading results if the input data is flawed.
The importance of context and metadata
A number alone is often meaningless without context such as units, collection method, time period, and population. This is closely related to metadata (data describing other data). For a data set, metadata might include column definitions, units, and known limitations.
Without context, it’s easy to compare incorrectly (such as mixing Celsius and Fahrenheit).
Privacy and data collection
Collecting data about people raises privacy concerns. Even “non-sensitive” data like location patterns can reveal sensitive information. Combining data sources can enable re-identification, and data can be used in ways users didn’t expect. AP CSP often frames this as a tradeoff: data can be beneficial (for example, emergency services) but also misused.
Exam Focus
- Typical question patterns:
- Identify how bias or collection methods could affect a conclusion.
- Explain why metadata/context is needed to interpret a data set.
- Discuss privacy risks of collecting or combining data.
- Common mistakes:
- Treating a data set as automatically objective or unbiased.
- Ignoring how missing/incorrect data can change outcomes.
- Assuming that removing names always guarantees anonymity (other fields can still identify people).
Extracting Information from Data: Patterns, Visualizations, and Limits
Once you have a data set, the goal is to learn something from it. Extracting information means using computation and reasoning to find patterns, summarize trends, and support decisions.
From question to conclusion (a practical pipeline)
A strong investigation process is:
- Ask a question.
- Identify relevant data (what variables matter?).
- Clean/prepare data.
- Compute summaries or transformations.
- Visualize and interpret.
- Communicate conclusions and limitations.
Skipping cleaning or interpretation often produces misleading results.
Data extraction and transformation (common real-world workflow)
In many real systems, data extraction specifically means obtaining data from a database or software system (for example, a social media site) so it can be transported into tools like spreadsheets that support analytical processing.
A common workflow is:
- Extract the data (first step).
- Transform it (using filters or programs).
- Analyze it using graphs and other visualization tools.
Practical steps often include analyzing what the data sources are (web pages, emails, chat logs, video files, audio files, text documents, customer messages), deciding what outcomes are needed (trend, effect, cause, quantity, etc.), choosing tools and repositories (like databases) to store and read the data, cleaning issues like whitespace/symbols/duplicates, and then using visualization tools to understand patterns and “text flow.”
Common ways to summarize data
Computers can compute summaries quickly, such as counts/frequencies, minimum/maximum, and measures of typical values like mean or median. Even when arithmetic isn’t required, you may need to pick an appropriate summary; for example, with extreme outliers, the median is often more representative than the mean.
Filtering, grouping, and trends
Many insights come from reorganizing or narrowing data:
- Filtering keeps only rows matching a condition.
- Grouping splits into categories (such as by city) for comparison.
- Time trends track change over time.
Programs excel here because they can process millions of rows reliably.
Data visualization (why graphs help and how they can mislead)
A graph is a pictorial representation (diagram) used to represent data, often to show relationships. Visualizations help you spot trends, clusters, and outliers quickly and communicate results clearly.
However, graphs can mislead through poor design or missing context. Changing axis scales can exaggerate or hide differences, leaving out units or time ranges distorts meaning, and a visual correlation can tempt you into assuming causation.
How to read and analyze graphs (common types)
Different graphs/charts display data in different ways, and some are better suited for certain tasks.
- Graphs and charts commonly represent data using points, lines, bars, pie charts, and scatter plots.
- Picture graphs use pictures to represent values.
- Bar graphs use vertical or horizontal bars to represent values.
- Line graphs use lines to represent values (often over time).
- Scatter plots represent data with points; a best-fit line is sometimes drawn through some of the points to show a trend.
Correlation vs causation
If two variables move together, they may be correlated, but correlation does not prove that one causes the other. Possible explanations include A causes B, B causes A, a third factor causes both, or coincidence (especially with small data sets). AP CSP frequently tests whether you can avoid the “correlation implies causation” trap.
Limits of data analysis
Even perfect computation cannot fix biased or unrepresentative data, missing context, ambiguous definitions, or ethical concerns about what should be measured or inferred. Large data sets can reduce random error, but they do not automatically guarantee truth.
Exam Focus
- Typical question patterns:
- Choose or interpret a visualization; explain what pattern it shows.
- Decide which operation (filter, count, group) would answer a question.
- Identify whether a claim is causal or merely correlational.
- Common mistakes:
- Assuming correlation proves causation.
- Ignoring how axis scaling or missing labels/units can mislead.
- Treating computed outputs as meaningful without checking data quality and context.
Using Programs with Data: Lists, Abstraction, and Processing at Scale
The increase in digitization of information, combined with many transactions and interactions, has produced a flood of data and rapid growth in data volume. By analyzing large data sets, it is possible to categorize connections across sources that seem unconnected and to find patterns that would be impossible to see manually.
Data abstraction (managing complexity)
Data abstraction means representing complex real-world information in a simplified form that is useful for computation. This is not about “lying”; it’s choosing details that matter for your purpose.
Examples include a map representing roads as lines and intersections as nodes, a music app representing a song as samples (not physical air vibrations), or a student database representing a person with fields like name, grade level, and ID.
Lists as a core data structure
In AP CSP pseudocode, a list is an ordered collection of items. Lists support data at scale and enable algorithms like searching, filtering, and aggregation.
A key AP CSP detail is that list indexing on the exam reference sheet is typically 1-based (the first element is at index 1). Students used to 0-based languages often make off-by-one errors.
Traversals (processing each item)
A list traversal iterates over each element to compute a count, total, maximum, filtered list, and more. Conceptually, traversal turns raw data into information.
Example: counting items that meet a condition (AP-style pseudocode)
count ← 0
FOR EACH item IN dataList
{
IF (item > 50)
{
count ← count + 1
}
}
DISPLAY(count)
A common mistake is accidentally resetting the running total/count inside the loop so it never accumulates.
Finding a maximum or minimum
To find the largest value in a list, you typically start with the first element, compare each element to your current best, and update when you find something larger.
max ← dataList[1]
FOR EACH item IN dataList
{
IF (item > max)
{
max ← item
}
}
DISPLAY(max)
Using lists for paired or tabular data
Real data sets often have multiple attributes per record (name, age, city). Common representations include parallel lists (matched by index) or a list of records/rows (each item is itself a list). Structure affects how easy it is to analyze data correctly; parallel lists are error-prone if indices get misaligned.
Data transformations (turning data into a new form)
Programs often transform data to make it more useful, such as converting units (miles to kilometers), normalizing text (lowercasing), bucketing values into categories, or generating derived values (like average speed from distance and time). This connects directly to data abstraction: you are creating a representation that better matches your goal.
What can go wrong: hidden assumptions in code
Even correct-looking code can produce wrong conclusions if it assumes things that aren’t guaranteed, such as assuming no missing values exist, assuming units are consistent, or assuming a list is non-empty (max-finding code fails on an empty list).
Exam Focus
- Typical question patterns:
- Trace a list traversal and determine the final value of a variable (count, sum, max).
- Identify which code segment correctly filters or aggregates a list.
- Reason about the impact of 1-based indexing on which element is accessed.
- Common mistakes:
- Off-by-one errors from confusing 1-based and 0-based indexing.
- Updating the wrong variable inside a loop (overwriting instead of accumulating).
- Assuming properties of the data (non-empty, sorted, no missing values) that aren’t guaranteed.