How To Load Machine Learning Data From Files In Python

The common data format in Machine Learning is a CSV file (comma separated values). In this Tutorial I show 4 different ways how you can load the data from such files and then prepare the data.


The common data format in Machine Learning is a CSV file (comma separated values). In this Tutorial I show 4 different ways how you can load the data from such files and then prepare the data. I also show you some best practices on how to deal with the correct data type, missing values, and an optional header. The 4 approaches are:

  • with the csv module
  • with numpy: np.loadtxt() and numpy.genfromtxt()
  • with pandas: pd.read_csv()

If you enjoyed this video, please subscribe to the channel!

The code and all Machine Learning tutorials can be found on GitHub.

import csv
import numpy as np
import pandas as pd

# download data from https://archive.ics.uci.edu/ml/datasets/spambase
FILE_NAME = "spambase.data"

# 1) load with csv file
with open(FILE_NAME, 'r') as f:
    data = list(csv.reader(f, delimiter=","))
data = np.array(data, dtype=np.float32)
print(data.shape)

# 2) load with np.loadtxt()
# skiprows=1
data = np.loadtxt(FILE_NAME, delimiter=",",dtype=np.float32)
print(data.shape, data.dtype)

# 3) load with np.genfromtxt()
# skip_header=0, missing_values="---", filling_values=0.0
data = np.genfromtxt(FILE_NAME, delimiter=",", dtype=np.float32)
print(data.shape)

# split into X and y
n_samples, n_features = data.shape
n_features -= 1
X = data[:, 0:n_features]
y = data[:, n_features]
print(X.shape, y.shape)
print(X[0, 0:5])
# or if y is the first column
# X = data[:, 1:n_features+1]
# y = data[:, 0]

# 4) load with pandas: read_csv()
# na_values = ['---']
df = pd.read_csv(FILE_NAME, header=None, skiprows=0, dtype=np.float32)
df = df.fillna(0.0)
# dataframe to numpy
data = df.to_numpy()
print(data[4, 0:5])

# convert datatypes in numpy
#data = np.asarray(data, dtype = np.float32)
#print(data.dtype)

FREE VS Code / PyCharm Extensions I Use

✅ Write cleaner code with Sourcery, instant refactoring suggestions: Link*


Python Problem-Solving Bootcamp

🚀 Solve 42 programming puzzles over the course of 21 days: Link*

* These are affiliate link. By clicking on it you will not have any additional costs. Instead, you will support my project. Thank you! 🙏