Arrays , Strings and Hashing - (Theory)
Comprehensive Guide: Arrays, Strings, and Hashing
ARRAYS
Fundamental Concepts
Definition and Internal Structure
An array is a homogeneous data structure that stores elements of the same type in a linear sequence. In low-level languages like C/C++, arrays are stored in contiguous memory locations, enabling efficient cache utilization and arithmetic-based indexing. However, in Java, arrays are reference types stored in the heap memory, which may not guarantee physical contiguity due to garbage collection and memory fragmentation.
Memory Model Deep Dive
In Java's memory model:
- Array objects reside in the heap
- The array reference (variable) is stored in the stack
- Each array element is stored as a continuous block in the heap
- Multi-dimensional arrays are arrays of references, creating a tree-like structure in memory
int[] singleArray = new int[1000]; // Single contiguous block
int[][] matrix = new int[100][100]; // 100 separate arrays referenced by one array
Physical vs Logical Organization
While logically arrays appear contiguous, physically in Java:
- Single-dimensional arrays maintain locality
- Multi-dimensional arrays break locality due to reference indirection
- This affects cache performance in performance-critical applications
Static vs Dynamic Arrays
Static Arrays
- Fixed size determined at declaration
- Memory allocated once during creation
- Cannot resize during runtime
- Direct memory access patterns
- Example: Traditional C arrays, Java primitive arrays
Dynamic Arrays
- Size can change during runtime
- Implemented using resizing strategies (typically doubling)
- Amortized constant time for append operations
- Example: ArrayList, Vector, Array in JavaScript
Resizing Strategy Deep Dive
Dynamic arrays use growth factors:
- Factor of 2: Doubles size when full, provides amortized O(1) insertion
- Golden Ratio (1.618): More memory efficient but complex
- Factor of 1.5: Used by some implementations, balances memory and performance
Multi-Dimensional Arrays
2D Arrays Implementation
In Java, 2D arrays are implemented as arrays of arrays:
int[][] matrix = new int[rows][cols];
// Equivalent to:
int[][] matrix = new int[rows][];
for(int i = 0; i < rows; i++) {
matrix[i] = new int[cols];
}
Jagged Arrays
Java supports jagged arrays where each row can have different lengths:
int[][] jaggedArray = new int[3][];
jaggedArray[0] = new int[2]; // First row has 2 elements
jaggedArray[1] = new int[5]; // Second row has 5 elements
jaggedArray[2] = new int[3]; // Third row has 3 elements
N-Dimensional Arrays
For higher dimensions, the reference chain becomes deeper:
- 3D: Array of arrays of arrays
- 4D: Array of arrays of arrays of arrays
- Performance degrades with dimension increase due to cache misses
Array Operations with Complexity Analysis
Insertion Operations
- At End: O(1) if space available, O(n) if resizing needed
- At Beginning: O(n) - requires shifting all elements
- At Middle: O(n) - requires shifting elements after insertion point
- In Sorted Array: O(n) - requires finding position + shifting
Deletion Operations
- From End: O(1) - simply decrement size
- From Beginning: O(n) - shift all elements left
- From Middle: O(n) - shift elements after deletion point
- By Value: O(n) - search + deletion
Update Operations
- Random Access: O(1) - direct indexing
- Sequential Update: O(n) - must visit each element
Traversal Patterns
- Forward: Standard O(n) iteration
- Backward: O(n) with reverse indexing
- Skip Patterns: O(n/k) for every kth element
Practical Scenarios for Arrays
Scenario 1: Image Processing
Arrays are fundamental in image processing where images are represented as 2D or 3D arrays:
// RGB image representation
int[][][] image = new int[height][width][3]; // height x width x RGB
// Processing pixels requires efficient random access
for(int y = 0; y < height; y++) {
for(int x = 0; x < width; x++) {
// Apply filter to pixel at (x,y)
int red = image[y][x][0];
int green = image[y][x][1];
int blue = image[y][x][2];
}
}
Scenario 2: Matrix Operations in Scientific Computing
Mathematical computations heavily rely on multi-dimensional arrays:
// Matrix multiplication
double[][] multiply(double[][] A, double[][] B) {
int n = A.length, m = B[0].length, p = B.length;
double[][] C = new double[n][m];
for(int i = 0; i < n; i++) {
for(int j = 0; j < m; j++) {
for(int k = 0; k < p; k++) {
C[i][j] += A[i][k] * B[k][j];
}
}
}
return C;
}
Scenario 3: Game Development - Grid-Based Games
2D arrays naturally represent game boards, maps, or grids:
// Chess board representation
enum Piece { PAWN, ROOK, KNIGHT, BISHOP, QUEEN, KING, EMPTY }
Piece[][] chessBoard = new Piece[8][8];
// Pathfinding in grid maps
boolean[][] obstacles = new boolean[mapHeight][mapWidth];
int[][] distances = new int[mapHeight][mapWidth];
Scenario 4: Time Series Data Analysis
Arrays store sequential data for analysis:
// Stock price analysis
double[] prices = new double[365]; // Daily prices for a year
double[] movingAverage = calculateMovingAverage(prices, windowSize);
// Signal processing
double[] signal = new double[sampleRate * duration];
double[] filteredSignal = applyLowPassFilter(signal);
Scenario 5: Buffer Management in Systems Programming
Arrays serve as buffers for I/O operations:
// Network packet buffer
byte[] buffer = new byte[1024];
int bytesRead = inputStream.read(buffer);
// Circular buffer for real-time data
class CircularBuffer {
private int[] buffer;
private int head, tail, size;
public void enqueue(int value) {
buffer[tail] = value;
tail = (tail + 1) % buffer.length;
if(size < buffer.length) size++;
else head = (head + 1) % buffer.length;
}
}
STRINGS
Immutability and Memory Implications
Why Immutability?
Java strings are immutable for several critical reasons:
- Security: Strings often hold sensitive data (passwords, file paths)
- Thread Safety: Multiple threads can safely share string references
- String Pool Optimization: Identical strings share memory
- Hash Code Caching: Hash codes can be computed once and cached
Memory Model for Strings
String s1 = "Hello"; // Stored in String Pool (Method Area)
String s2 = "Hello"; // Points to same object in pool
String s3 = new String("Hello"); // Creates new object in heap
String Pool Mechanism
The String Pool (part of the Method Area) stores unique string literals:
- Reduces memory footprint
- Enables fast equality checks via reference comparison
- Automatic deduplication of string literals
Mutable String Alternatives
StringBuilder vs StringBuffer
| Aspect | StringBuilder | StringBuffer |
|---|---|---|
| Thread Safety | Not synchronized | Synchronized |
| Performance | Faster | Slower due to synchronization |
| Use Case | Single-threaded | Multi-threaded |
| Memory Overhead | Lower | Higher |
Internal Implementation
Both use resizable character arrays:
// Simplified StringBuilder implementation
public class StringBuilder {
private char[] value;
private int count;
public StringBuilder append(String str) {
int len = str.length();
ensureCapacity(count + len);
str.getChars(0, len, value, count);
count += len;
return this;
}
private void ensureCapacity(int minimumCapacity) {
if (minimumCapacity > value.length) {
int newCapacity = value.length * 2 + 2;
value = Arrays.copyOf(value, newCapacity);
}
}
}
String Operations and Complexity
Concatenation Analysis
// Inefficient - O(n²) overall complexity
String result = "";
for(int i = 0; i < n; i++) {
result += "a"; // Creates new string each time
}
// Efficient - O(n) overall complexity
StringBuilder sb = new StringBuilder();
for(int i = 0; i < n; i++) {
sb.append("a"); // Amortized O(1) per operation
}
Substring Operations
In modern Java (7+), substring creates a new character array:
- Time: O(n) where n is substring length
- Space: O(n) for the new string
- No memory leaks unlike older versions
Character Access and Traversal
// Random access - O(1)
char c = str.charAt(index);
// Sequential traversal - O(n)
for(char c : str.toCharArray()) {
// process character
}
// Enhanced for loop - O(n) but with method call overhead
for(int i = 0; i < str.length(); i++) {
char c = str.charAt(i);
}
Practical Scenarios for Strings
Scenario 1: Text Processing and Natural Language Processing
String manipulation is crucial in text analysis:
// Text preprocessing for NLP
public String preprocessText(String text) {
StringBuilder processed = new StringBuilder();
// Convert to lowercase, remove punctuation, normalize whitespace
for(char c : text.toCharArray()) {
if(Character.isLetterOrDigit(c)) {
processed.append(Character.toLowerCase(c));
} else if(Character.isWhitespace(c) &&
processed.length() > 0 &&
processed.charAt(processed.length()-1) != ' ') {
processed.append(' ');
}
}
return processed.toString().trim();
}
Scenario 2: Log Processing and Analysis
String operations are essential for parsing log files:
// Log entry parsing
public class LogParser {
public LogEntry parseLine(String line) {
// Format: [2023-12-01 10:30:45] ERROR: Database connection failed
int timestampEnd = line.indexOf(']');
String timestamp = line.substring(1, timestampEnd);
int levelStart = timestampEnd + 2;
int levelEnd = line.indexOf(':', levelStart);
String level = line.substring(levelStart, levelEnd);
String message = line.substring(levelEnd + 2);
return new LogEntry(timestamp, level, message);
}
}
Scenario 3: Web Development - URL and Query String Processing
String manipulation in web applications:
// URL parameter extraction
public Map<String, String> parseQueryString(String queryString) {
Map<String, String> params = new HashMap<>();
if(queryString == null || queryString.isEmpty()) {
return params;
}
String[] pairs = queryString.split("&");
for(String pair : pairs) {
String[] keyValue = pair.split("=", 2);
if(keyValue.length == 2) {
String key = URLDecoder.decode(keyValue[0], "UTF-8");
String value = URLDecoder.decode(keyValue[1], "UTF-8");
params.put(key, value);
}
}
return params;
}
Scenario 4: Configuration File Processing
Parsing configuration files and property files:
// Properties file parser
public class ConfigParser {
public Properties parseConfig(String configText) {
Properties props = new Properties();
StringBuilder currentValue = new StringBuilder();
String currentKey = null;
String[] lines = configText.split("\n");
for(String line : lines) {
line = line.trim();
if(line.isEmpty() || line.startsWith("#")) {
continue; // Skip empty lines and comments
}
if(line.contains("=")) {
if(currentKey != null) {
props.setProperty(currentKey, currentValue.toString());
}
String[] parts = line.split("=", 2);
currentKey = parts[0].trim();
currentValue = new StringBuilder(parts[1].trim());
} else if(currentKey != null) {
// Continuation line
currentValue.append(" ").append(line);
}
}
if(currentKey != null) {
props.setProperty(currentKey, currentValue.toString());
}
return props;
}
}
Scenario 5: Template Engine Implementation
String replacement and template processing:
// Simple template engine
public class TemplateEngine {
public String processTemplate(String template, Map<String, String> variables) {
StringBuilder result = new StringBuilder();
int pos = 0;
while(pos < template.length()) {
int start = template.indexOf("${", pos);
if(start == -1) {
result.append(template.substring(pos));
break;
}
result.append(template.substring(pos, start));
int end = template.indexOf("}", start);
if(end == -1) {
throw new IllegalArgumentException("Unclosed variable: " + template.substring(start));
}
String varName = template.substring(start + 2, end);
String value = variables.getOrDefault(varName, "${" + varName + "}");
result.append(value);
pos = end + 1;
}
return result.toString();
}
}
HASHING
Hash Function Design and Theory
Mathematical Foundation
A good hash function should exhibit these properties:
- Deterministic: Same input always produces same output
- Uniform Distribution: Hash values spread evenly across the hash table
- Avalanche Effect: Small input changes cause large hash changes
- Efficiency: Fast to compute
Polynomial Rolling Hash for Strings
The polynomial rolling hash is fundamental in string hashing:
hash(s) = (s[0] * p^0 + s[1] * p^1 + ... + s[n-1] * p^(n-1)) mod m
Where:
pis a prime number (commonly 31, 53, or larger primes)mis a large prime number to reduce collisions- Powers of
pcreate position sensitivity
Why Prime Numbers in Hashing?
Prime numbers are chosen because:
- Reduced Patterns: Primes break up regular patterns in data
- Mathematical Properties: Better distribution due to number theory
- Collision Reduction: Prime modulus reduces clustering
- Bit Manipulation: 31 allows optimization:
31 * x = (x << 5) - x
Java's String Hash Implementation
public int hashCode() {
int h = hash; // Check if already computed
if (h == 0 && value.length > 0) {
char val[] = value;
for (int i = 0; i < value.length; i++) {
h = 31 * h + val[i];
}
hash = h; // Cache the result
}
return h;
}
This implementation:
- Uses 31 as the polynomial base
- Caches the result for immutable strings
- Processes characters left-to-right
- Allows hash code computation in O(n) time
Collision Resolution Strategies
Separate Chaining (Java's Approach)
Each hash table slot contains a bucket (linked list or tree):
// Simplified HashMap bucket structure
class HashMapEntry<K,V> {
final K key;
V value;
HashMapEntry<K,V> next;
final int hash;
HashMapEntry(int hash, K key, V value, HashMapEntry<K,V> next) {
this.hash = hash;
this.key = key;
this.value = value;
this.next = next;
}
}
Tree Conversion in Java 8+
When chain length exceeds threshold (8), Java converts to balanced tree:
- Worst-case search becomes O(log n) instead of O(n)
- Requires comparable keys or custom comparison logic
- Converts back to list when size drops below threshold (6)
Open Addressing Alternatives
- Linear Probing: Check next slot sequentially
- Quadratic Probing: Check slots at quadratic intervals
- Double Hashing: Use second hash function for step size
Load Factor and Resizing
Load Factor Significance
Load Factor = Number of Elements / Table Size
Java's default load factor of 0.75 balances:
- Space: Not too much wasted space
- Time: Reasonable collision probability
- Resize Frequency: Not too frequent resizing
Resizing Process
When load factor exceeds threshold:
- Create new table (typically double size)
- Rehash all existing elements
- Update table reference
- Time complexity: O(n) for rehashing
HashMap vs TreeMap Comparison
HashMap Characteristics
- Average: O(1) for basic operations
- Worst Case: O(n) if many collisions
- Memory: Lower overhead per entry
- Ordering: No guaranteed order
TreeMap Characteristics
- Guaranteed: O(log n) for all operations
- Ordering: Maintains sorted order
- Memory: Higher overhead (tree pointers)
- Requirements: Keys must be comparable
Practical Scenarios for Hashing
Scenario 1: Caching and Memoization
Hash maps excel at caching computed results:
// Fibonacci with memoization
public class FibonacciCache {
private Map<Integer, Long> cache = new HashMap<>();
public long fibonacci(int n) {
if(n <= 1) return n;
if(cache.containsKey(n)) {
return cache.get(n);
}
long result = fibonacci(n-1) + fibonacci(n-2);
cache.put(n, result);
return result;
}
}
// Web request caching
public class WebCache {
private Map<String, CacheEntry> cache = new HashMap<>();
private static final long CACHE_DURATION = 5 * 60 * 1000; // 5 minutes
public String getCachedResponse(String url) {
CacheEntry entry = cache.get(url);
if(entry != null && System.currentTimeMillis() - entry.timestamp < CACHE_DURATION) {
return entry.response;
}
String response = fetchFromServer(url);
cache.put(url, new CacheEntry(response, System.currentTimeMillis()));
return response;
}
}
Scenario 2: Database Indexing and Query Optimization
Hash indexes provide fast lookups:
// In-memory database index
public class DatabaseIndex {
private Map<Object, Set<Integer>> index = new HashMap<>();
public void addToIndex(Object key, int rowId) {
index.computeIfAbsent(key, k -> new HashSet<>()).add(rowId);
}
public Set<Integer> findRows(Object key) {
return index.getOrDefault(key, Collections.emptySet());
}
// Compound index for multiple columns
public class CompoundKey {
private final Object[] keys;
public CompoundKey(Object... keys) {
this.keys = keys.clone();
}
@Override
public boolean equals(Object obj) {
if(this == obj) return true;
if(!(obj instanceof CompoundKey)) return false;
return Arrays.equals(keys, ((CompoundKey) obj).keys);
}
@Override
public int hashCode() {
return Arrays.hashCode(keys);
}
}
}
Scenario 3: Distributed Systems and Consistent Hashing
Hash functions distribute data across multiple servers:
// Consistent hashing for distributed cache
public class ConsistentHashRing {
private TreeMap<Integer, String> ring = new TreeMap<>();
private int virtualNodes = 100;
public void addNode(String node) {
for(int i = 0; i < virtualNodes; i++) {
String virtualNode = node + ":" + i;
int hash = hashFunction(virtualNode);
ring.put(hash, node);
}
}
public String getNode(String key) {
if(ring.isEmpty()) return null;
int hash = hashFunction(key);
Map.Entry<Integer, String> entry = ring.ceilingEntry(hash);
if(entry == null) {
entry = ring.firstEntry(); // Wrap around
}
return entry.getValue();
}
private int hashFunction(String input) {
return input.hashCode();
}
}
Scenario 4: Real-time Analytics and Event Processing
Hash maps track metrics and counters:
// Real-time metrics aggregation
public class MetricsAggregator {
private Map<String, AtomicLong> counters = new ConcurrentHashMap<>();
private Map<String, DescriptiveStatistics> metrics = new ConcurrentHashMap<>();
public void incrementCounter(String name) {
counters.computeIfAbsent(name, k -> new AtomicLong()).incrementAndGet();
}
public void recordMetric(String name, double value) {
metrics.computeIfAbsent(name, k -> new DescriptiveStatistics()).addValue(value);
}
public Map<String, Double> getAverages() {
return metrics.entrySet().stream()
.collect(Collectors.toMap(
Map.Entry::getKey,
e -> e.getValue().getMean()
));
}
}
// Session tracking for web applications
public class SessionManager {
private Map<String, UserSession> sessions = new ConcurrentHashMap<>();
private static final long SESSION_TIMEOUT = 30 * 60 * 1000; // 30 minutes
public UserSession getSession(String sessionId) {
UserSession session = sessions.get(sessionId);
if(session != null && session.isExpired()) {
sessions.remove(sessionId);
return null;
}
return session;
}
public void createSession(String sessionId, String userId) {
sessions.put(sessionId, new UserSession(userId, System.currentTimeMillis()));
}
public void cleanupExpiredSessions() {
long now = System.currentTimeMillis();
sessions.entrySet().removeIf(entry ->
now - entry.getValue().getLastAccess() > SESSION_TIMEOUT
);
}
}
Scenario 5: Duplicate Detection and Data Deduplication
Hash sets efficiently identify duplicates:
// File deduplication using content hashing
public class FileDeduplicator {
private Map<String, String> hashToPath = new HashMap<>();
private Set<String> processedHashes = new HashSet<>();
public void processFile(String filePath) throws IOException {
String contentHash = calculateFileHash(filePath);
if(processedHashes.contains(contentHash)) {
String existingPath = hashToPath.get(contentHash);
System.out.println("Duplicate found: " + filePath + " matches " + existingPath);
// Could create hard link or symbolic link here
} else {
processedHashes.add(contentHash);
hashToPath.put(contentHash, filePath);
}
}
private String calculateFileHash(String filePath) throws IOException {
MessageDigest md = MessageDigest.getInstance("SHA-256");
try(FileInputStream fis = new FileInputStream(filePath)) {
byte[] buffer = new byte[8192];
int bytesRead;
while((bytesRead = fis.read(buffer)) != -1) {
md.update(buffer, 0, bytesRead);
}
}
return bytesToHex(md.digest());
}
}
// Email address validation and normalization
public class EmailProcessor {
private Set<String> processedEmails = new HashSet<>();
private Map<String, String> normalizedToOriginal = new HashMap<>();
public boolean addEmail(String email) {
String normalized = normalizeEmail(email);
if(processedEmails.contains(normalized)) {
return false; // Duplicate
}
processedEmails.add(normalized);
normalizedToOriginal.put(normalized, email);
return true;
}
private String normalizeEmail(String email) {
// Convert to lowercase, remove dots from local part for Gmail, etc.
String[] parts = email.toLowerCase().split("@");
if(parts.length != 2) {
throw new IllegalArgumentException("Invalid email format");
}
String localPart = parts[0];
String domain = parts[1];
// Gmail-specific normalization
if(domain.equals("gmail.com") || domain.equals("googlemail.com")) {
localPart = localPart.replace(".", "");
int plusIndex = localPart.indexOf('+');
if(plusIndex != -1) {
localPart = localPart.substring(0, plusIndex);
}
domain = "gmail.com"; // Normalize googlemail to gmail
}
return localPart + "@" + domain;
}
}
Performance Comparison Summary
| Operation | Array | String | StringBuilder | HashMap | HashSet | TreeMap | TreeSet |
|---|---|---|---|---|---|---|---|
| Insertion (end) | O(1) | O(n) | O(1) amortized | O(1) average | O(1) average | O(log n) | O(log n) |
| Insertion (middle) | O(n) | O(n) | O(n) | N/A | N/A | N/A | N/A |
| Deletion | O(n) | O(n) | O(n) | O(1) average | O(1) average | O(log n) | O(log n) |
| Update | O(1) | N/A (immutable) | O(1) | O(1) average | N/A | O(log n) | N/A |
| Search/Access | O(1) random, O(n) search | O(1) by index | O(1) by index | O(1) average | O(1) average | O(log n) | O(log n) |
| Traversal | O(n) | O(n) | O(n) | O(n) | O(n) | O(n) | O(n) |
| Space Complexity | O(n) | O(n) | O(n) | O(n) | O(n) | O(n) | O(n) |
Key Takeaways
When to Use Arrays:
- Fixed-size data with frequent random access
- Mathematical computations and matrix operations
- Performance-critical applications requiring cache locality
- Implementing other data structures
When to Use Strings:
- Text processing and manipulation
- Immutable data that benefits from sharing
- Configuration and template processing
- Inter-system communication protocols
When to Use HashMaps/HashSets:
- Fast lookups and duplicate detection
- Caching and memoization
- Counting and frequency analysis
- Database indexing (non-ordered)
When to Use TreeMaps/TreeSets:
- Ordered data requirements
- Range queries and sorted iteration
- Maintaining sorted collections
- Database indexing (ordered)