Questions tagged [apache-spark]

Use this tag for questions related to the distributed computing framework published by Apache.

19 questions
2
votes
1 answer

Can we use Cassandra in place of Hadoop with Spark?

Considering we have a backend written in NodeJS and uses MySQL and Cassandra as it's databases, if we want to add Spark to the system to do some data analyzing stuff like recommendation, can we do it with Cassandra( I mean using Spark + Cassandra)…
2
votes
1 answer

Best practices for large JOINs - Warehouse or External Compute (e.g. Spark)

I am working on a problem that requires a very large join. The JOIN itself is pretty straightforward but the amount of data I am processing is very large. I am wondering for very large JOINs, is there a preferred type of technology. For example, is…
Arthur Putnam
  • 553
  • 2
  • 6
  • 12
2
votes
2 answers

Check the time that PostgreSQL is taking to automatically create existing indexes when you do bulk insert using copy command from Spark

I am doing a bulk insert from the spark to the postgres table. Amount of data that I am ingesting is huge. The number of records is around 120-130 million I am first saving the records as multiple csv files on distributed storage location i.e. S3…
1
vote
0 answers

Writing large dataset from spark dataframe

We have a azure databricks job that retrieves some large dataset with pyspark. The dataframe has about 11 billion rows. We are currently writing this out to a postgresql DB (also in azure). Currently we are using the jdbc connector to write row out…
1
vote
0 answers

Why does MySQL select offset not work in pyspark?

select distinct day from t1 order by 1 desc limit 1,1 should return the second latest date in t1, but this error is thrown: ParseException: mismatched input ',' expecting {, ';'}
1
vote
1 answer

Optimal join for joining facts with scd-type-2 dimension for aggregation/reporting

I have a fact table and an scd-type-2 dimension table. I want to produce sales report by region and year. I have working solution with a query that joins them for reporting purposes. When I run the query in spark/databricks, it gives me a little…
1
vote
2 answers

When will the Spark Cassandra connector with support for Spark 3.3 be released?

Master branch has a merged pull request supporting Spark 3.3 how long before this gets built and published on maven repository?
Rubber Duck
  • 111
  • 3
1
vote
2 answers

Cassandra-4-Update: Multiple Schema Versions

after upgrading our first node, it has a different schema version (according to node tool describe cluster). This caused Spark Jobs to hang, because of reoccurring "schema agreement not reached" by metadata.SchemaAgreementChecker. Is this different…
Sven
  • 11
  • 2
1
vote
0 answers

MySQL is not showing the history queries after the ETL is complete

I am using docker container for MySQL and Spark. Both containers are on AWS Ec2 instance. Pyspark ETL connects to MySQL with JDBC and start extracting the data. When the ETL is running i can see the long running threads/queries in MySQL with command…
nomad123
  • 19
  • 1
  • 3
1
vote
0 answers

Pyspark error inserting into mysql database table that exists

I am trying to insert into an existing mysql table using Pyspark JDBC connection however I get the following error(picture attached) Can I get assistance on this error. The table exist in the MySql Database, I was successful in Inserting with a…
IIShriyaII
  • 11
  • 3
1
vote
0 answers

HIVE + understanding the hive-metastore logs

we have HDP cluster version - 2.6.4 , and we are runs spark streaming app and we are uses presto cluster in order to run Hive queries when we look on the hivemetastore logs ( under /var/log/hive ) , we can see the following warnings , that repeat…
King David
  • 111
  • 1
  • 4
0
votes
0 answers

Py4JJavaError while creating SparkSession

I encountered this error while trying to create spark session in a python virtual environment using VS Code as IDE. The code I ran and the output is below, please help. Code spark = SparkSession.builder\ .master("local[1]")\ …
tomiealff
  • 1
  • 2
0
votes
1 answer

spark-cassandra-connector read throughput unpredictable

A user reports that the range query throughput is far higher than expected when setting spark.cassandra.input.readsPerSec in the spark-cassandra-connector. Job dependencies. The Java driver version is set to 4.13.0.
Paul
  • 416
  • 2
  • 6
0
votes
0 answers

Is there an open source implementation of QGM (Query Graph Model)?

I am building a new system that needs to interact essentially as an SQL backend. We would like to import logical queries into it (e.g. from ApacheSPARQ or Postgres and related things) and want to develop an internal representation (IR) for them. …
intel_chris
  • 141
  • 3
0
votes
3 answers

Data storage for analytics

I have to store some amount of data for analytical purposes. The data source produces 2TB data per month. Data is collected on a monthly basis (not real-time). Data is fully structured. There are 100+ different columns of data. Availability of SQL…
Leeloo
  • 111
  • 5
1
2